WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Naohiro Aota	4eef29ef63	btrfs: zoned: do not use async metadata checksum on zoned filesystems On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize the metadata write IOs. Even with this serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write bio ordering, we disable async metadata checksum on a zoned filesystem. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting the performance. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:07 +01:00
Naohiro Aota	24c0a7227f	btrfs: zoned: wait for existing extents before truncating When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:07 +01:00
Naohiro Aota	0bc09ca129	btrfs: zoned: serialize metadata IO We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:07 +01:00
Naohiro Aota	42c0110009	btrfs: zoned: introduce dedicated data write path for zoned filesystems If more than one IO is issued for one file extent, these IO can be written to separate regions on a device. Since we cannot map one file extent to such a separate area on a zoned filesystem, we need to follow the "one IO == one ordered extent" rule. The normal buffered, uncompressed and not pre-allocated write path (used by cow_file_range()) sometimes does not follow this rule. It can write a part of an ordered extent when specified a region to write e.g., when its called from fdatasync(). Introduce a dedicated (uncompressed buffered) data write path for zoned filesystems, that will COW the region and write it at once. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:06 +01:00
Naohiro Aota	544d24f9de	btrfs: zoned: enable zone append writing for direct IO Likewise to buffered IO, enable zone append writing for direct IO when its used on a zoned block device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:06 +01:00
Naohiro Aota	d8e3fb106f	btrfs: zoned: use ZONE_APPEND write for zoned mode Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:06 +01:00
Johannes Thumshirn	24533f6a9a	btrfs: save irq flags when looking up an ordered extent A following patch will add another caller of btrfs_lookup_ordered_extent(), but from a bio's endio context. btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally disables interrupts. Change this to spin_lock_irqsave() so interrupts aren't disabled and re-enabled unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:06 +01:00
Johannes Thumshirn	08f455593f	btrfs: zoned: cache if block group is on a sequential zone On a zoned filesystem, cache if a block group is on a sequential write only zone. On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing data, therefore provide btrfs_use_zone_append() to figure out if IO is targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND for data writing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:05 +01:00
Naohiro Aota	138082f366	btrfs: extend btrfs_rmap_block for specifying a device btrfs_rmap_block currently reverse-maps the physical addresses on all devices to the corresponding logical addresses. Extend the function to match to a specified device. The old functionality of querying all devices is left intact by specifying NULL as target device. A block_device instead of a btrfs_device is passed into btrfs_rmap_block, as this function is intended to reverse-map the result of a bio, which only has a block_device. Also export the function for later use. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:05 +01:00
Johannes Thumshirn	cacb2cea46	btrfs: zoned: check if bio spans across an ordered extent To ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Ensure that constructing bio does not span more than an ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:05 +01:00
Naohiro Aota	d22002fd37	btrfs: zoned: split ordered extent when bio is sent For a zone append write, the device decides the location the data is being written to. Therefore we cannot ensure that two bios are written consecutively on the device. In order to ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Implement splitting of an ordered extent and extent map on bio submission to adhere to the rule. extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the corresponding ordered extent so that the ordered extent's region fits into one bio and the corresponding device limits. Several sanity checks need to be done in extract_ordered_extent() e.g. - We cannot split once end_bio'd ordered extent because we cannot divide ordered->bytes_left for the split ones - We do not expect a compressed ordered extent - We should not have checksum list because we omit the list splitting. Since the function is called before btrfs_wq_submit_bio() or btrfs_csum_one_bio(), this should be always ensured. We also need to split an extent map by creating a new one. If not, unpin_extent_cache() complains about the difference between the start of the extent map and the file's logical offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:05 +01:00
Naohiro Aota	cfe94440d1	btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual devices. Let btrfs_end_bio() and btrfs_op be aware of it, by mapping REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of bio_op(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:05 +01:00
Naohiro Aota	e1326f0339	btrfs: zoned: use bio_add_zone_append_page A zoned device has its own hardware restrictions e.g. max_zone_append_size when using REQ_OP_ZONE_APPEND. To follow these restrictions, use bio_add_zone_append_page() instead of bio_add_page(). We need target device to use bio_add_zone_append_page(), so this commit reads the chunk information to cache the target device to btrfs_io_bio(bio)->device. Caching only the target device is sufficient here as zoned filesystems only supports the single profile at the moment. Once more profiles will be supported btrfs_io_bio can hold an extent_map to be able to check for the restrictions of all devices the btrfs_bio will be mapped to. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:04 +01:00
Naohiro Aota	953651eb30	btrfs: factor out helper adding a page to bio Factor out adding a page to a bio from submit_extent_page(). The page is added only when bio_flags are the same, contiguous and the added page fits in the same stripe as pages in the bio. Condition checks are reordered to allow early return to avoid possibly heavy btrfs_bio_fits_in_stripe() calling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:04 +01:00
Naohiro Aota	dcba6e48b5	btrfs: zoned: reset zones of unused block groups We must reset the zones of a deleted unused block group to rewind the zones' write pointers to the zones' start. To do this, we can use the DISCARD_SYNC code to do the reset when the filesystem is running on zoned devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:04 +01:00
Naohiro Aota	011b41bffa	btrfs: zoned: advance allocation pointer after tree log node Since the allocation info of a tree log node is not recorded in the extent tree, calculate_alloc_pointer() cannot detect this node, so the pointer can be over a tree node. Replaying the log calls btrfs_remove_free_space() for each node in the log tree. So, advance the pointer after the node to not allocate over it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:04 +01:00
Naohiro Aota	d3575156f6	btrfs: zoned: redirty released extent buffers Tree manipulating operations like merging nodes often release once-allocated tree nodes. Such nodes are cleaned so that pages in the node are not uselessly written out. On zoned volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. Introduce a list of clean and unwritten extent buffers that have been released in a transaction. Redirty the buffers so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. 'btrfs check'. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:04 +01:00
Naohiro Aota	2eda57089e	btrfs: zoned: implement sequential extent allocation Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Naohiro Aota	169e0da91a	btrfs: zoned: track unusable bytes for zones In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Naohiro Aota	a94794d50d	btrfs: zoned: calculate allocation offset for conventional zones Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset for sequential allocation if a block group contains a conventional zone. But instead, we can consider the end of the highest addressed extent in the block group for the allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex, which is already taken in btrfs_make_block_group(). Since it is a new block group anyways, we can simply set the allocation offset to 0. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Naohiro Aota	08e11a3db0	btrfs: zoned: load zone's allocation offset A zoned filesystem must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. To facilitate this, add an "alloc_offset" to the block-group to track the logical addresses of the write pointer. This logical address is populated in btrfs_load_block_group_zone_info() from the write pointers of corresponding zones. For now, zoned filesystems the single profile. Supporting non-single profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-single profiles for now. This commit supports the case all the zones in a block group are sequential. The next patch will handle the case having a conventional zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Naohiro Aota	381a696eb5	btrfs: zoned: verify device extent is aligned to zone Add a check in verify_one_dev_extent() to ensure that a device extent on a zoned block device is aligned to the respective zone boundary. If it isn't, mark the filesystem as unclean. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Naohiro Aota	1cd6121f2a	btrfs: zoned: implement zoned chunk allocator Implement a zoned chunk and device extent allocator. One device zone becomes a device extent so that a zone reset affects only this device extent and does not change the state of blocks in the neighbor device extents. To implement the allocator, we need to extend the following functions for a zoned filesystem. - init_alloc_chunk_ctl - dev_extent_search_start - dev_extent_hole_check - decide_stripe_size init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always set the stripe_size to the zone size and aligns the parameters to the zone size. dev_extent_search_start() only aligns the start offset to zone boundaries. We don't care about the first 1MB like in regular filesystem because we anyway reserve the first two zones for superblock logging. dev_extent_hole_check_zoned() checks if zones in given hole are either conventional or empty sequential zones. Also, it skips zones reserved for superblock logging. With the change to the hole, the new hole may now contain pending extents. So, in this case, loop again to check that. Finally, decide_stripe_size_zoned() should shrink the number of devices instead of stripe size because we need to honor stripe_size == zone_size. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:46:03 +01:00
Johannes Thumshirn	3c9daa09cc	btrfs: zoned: allow zoned filesystems on non-zoned block devices Run a zoned filesystem on non-zoned devices. This is done by "slicing up" the block device into static sized chunks and fake a conventional zone on each of them. The emulated zone size is determined from the size of device extent. This is mainly aimed at testing of zoned filesystems, i.e. the zoned chunk allocator, on regular block devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:21 +01:00
Naohiro Aota	1cb3dc3f79	btrfs: zoned: disallow fitrim on zoned filesystems The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:20 +01:00
Johannes Thumshirn	b53429bad3	btrfs: zoned: do not load fs_info::zoned from incompat flag Don't set the zoned flag in fs_info as soon as we're encountering the incompat filesystem flag for a zoned filesystem on mount. The zoned flag in fs_info is in a union together with the zone_size, so setting it too early will result in setting an incorrect zone_size as well. Once the correct zone_size is read from the device, we can rely on the zoned flag in fs_info as well to determine if the filesystem is zoned. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:20 +01:00
Johannes Thumshirn	4afd2fe835	btrfs: release path before calling to btrfs_load_block_group_zone_info Since we have no write pointer in conventional zones, we cannot determine the allocation offset from it. Instead, we set the allocation offset after the highest addressed extent. This is done by reading the extent tree in btrfs_load_block_group_zone_info(). However, this function is called from btrfs_read_block_groups(), so the read lock for the tree node could be recursively taken. To avoid this unsafe locking scenario, release the path before reading the extent tree to get the allocation offset. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:20 +01:00
Naohiro Aota	d6639b35da	btrfs: zoned: use regular super block location on zone emulation A zoned filesystem currently has a superblock at the beginning of the superblock logging zones if the zones are conventional. This difference in superblock position causes a chicken-and-egg problem for filesystems with emulated zones. Since the device is a regular (non-zoned) device, we cannot know if the filesystem is regular or zoned while reading the superblock. But, to load the superblock, we need to see if it is emulated zoned or not. Place the superblocks at the same location as they are on regular filesystem on regular devices to solve the problem. It is possible because it's ensured that all the superblock locations are at an (emulated) conventional zone on regular devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:19 +01:00
Naohiro Aota	7365104236	btrfs: zoned: defer loading zone info after opening trees This is a preparation patch to implement zone emulation on a regular device. To emulate a zoned filesystem on a regular (non-zoned) device, we need to decide an emulated zone size. Instead of making it a compile-time static value, we'll make it configurable at mkfs time. Since we have one zone == one device extent restriction, we can determine the emulated zone size from the size of a device extent. We can extend btrfs_get_dev_zone_info() to show a regular device filled with conventional zones once the zone size is decided. The current call site of btrfs_get_dev_zone_info() during the mount process is earlier than loading the file system trees so that we don't know the size of a device extent at this point. Thus we can't slice a regular device to conventional zones. This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone info for all the devices. And, it places this function in open_ctree() after loading the trees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 02:32:16 +01:00
Naohiro Aota	c3b0e880bb	iomap: support REQ_OP_ZONE_APPEND A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == REQ_OP_ZONE_APPEND. To utilize it, we need to set the bio_op before calling bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND and restricted bio. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-09 00:52:19 +01:00
Eric Whitney	3258386aba	ext4: reset retry counter when ext4_alloc_file_blocks() makes progress Change the retry policy in ext4_alloc_file_blocks() to allow for a full retry cycle whenever a portion of an allocation request has been fulfilled. A large allocation request often results in multiple calls to ext4_map_blocks(), each of which is potentially subject to a temporary ENOSPC condition and retry cycle. The current code only allows for a single retry cycle. This patch does not address a known bug or reported complaint. However, it should make block allocation for fallocate and zero range more robust. In addition, simplify the conditional controlling the allocation while loop, where testing len alone is sufficient. Remove the assignment to ret2 in the error path after the call to ext4_map_blocks() since its value isn't subsequently used. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210113221403.18258-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-08 18:03:56 -05:00
Filipe Manana	72c9925f87	btrfs: fix extent buffer leak on failure to copy root At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up returning without unlocking and releasing our reference on the extent buffer named "cow" we previously allocated with btrfs_alloc_tree_block(). So fix that by unlocking the extent buffer and dropping our reference on it before returning. Fixes: `be20aa9dba` ("Btrfs: Add mount option to turn off data cow") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:04 +01:00
Qu Wenruo	2c4d8cb737	btrfs: explain page locking and readahead in read_extent_buffer_pages() In read_extent_buffer_pages(), if we failed to lock the page atomically, we just exit with return value 0. This is counter-intuitive, as normally if we can't lock what we need, we would return something like EAGAIN. But that return hides under (wait == WAIT_NONE) branch, which only gets triggered for readahead. And for readahead, if we failed to lock the page, it means the extent buffer is either being read by other thread, or has been read and is under modification. Either way the eb will or has been cached, thus readahead has no need to wait for it. Add comment on this counter-intuitive behavior. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:04 +01:00
Qu Wenruo	0bb3eb3ee8	btrfs: allow read-only mount of 4K sector size fs on 64K page system This adds the basic RO mount ability for 4K sector size on 64K page system. Currently we only plan to support 4K and 64K page system. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	92082d4097	btrfs: integrate page status update for data read path into begin/end_page_read In btrfs data page read path, the page status update are handled in two different locations: btrfs_do_read_page() { while (cur <= end) { /* No need to read from disk / if (HOLE/PREALLOC/INLINE){ memset(); set_extent_uptodate(); continue; } / Read from disk */ ret = submit_extent_page(end_bio_extent_readpage); } end_bio_extent_readpage() { endio_readpage_uptodate_page_status(); } This is fine for sectorsize == PAGE_SIZE case, as for above loop we should only hit one branch and then exit. But for subpage, there is more work to be done in page status update: - Page Unlock condition Unlike regular page size == sectorsize case, we can no longer just unlock a page. Only the last reader of the page can unlock the page. This means, we can unlock the page either in the while() loop, or in the endio function. - Page uptodate condition Since we have multiple sectors to read for a page, we can only mark the full page uptodate if all sectors are uptodate. To handle both subpage and regular cases, introduce a pair of functions to help handling page status update: - begin_page_read() For regular case, it does nothing. For subpage case, it updates the reader counters so that later end_page_read() can know who is the last one to unlock the page. - end_page_read() This is just endio_readpage_uptodate_page_status() renamed. The original name is a little too long and too specific for endio. The new thing added is the condition for page unlock. Now for subpage data, we unlock the page if we're the last reader. This does not only provide the basis for subpage data read, but also hide the special handling of page read from the main read loop. Also, since we're changing how the page lock is handled, there are two existing error paths where we need to manually unlock the page before calling begin_page_read(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	32443de338	btrfs: introduce btrfs_subpage for data inodes To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	371cdc0700	btrfs: introduce subpage metadata validation check For subpage metadata validation check, there are some differences: - Read must finish in one bvec Since we're just reading one subpage range in one page, it should never be split into two bios nor two bvecs. - How to grab the existing eb Instead of grabbing eb using page->private, we have to go search radix tree as we don't have any direct pointer at hand. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	4325cb2293	btrfs: support subpage in endio_readpage_update_page_status() To handle subpage status update, add the following: - Use btrfs_page_*() subpage-aware helpers to update page status Now we can handle both cases well. - No page unlock for subpage metadata Since subpage metadata doesn't utilize page locking at all, skip it. For subpage data locking, it's handled in later commits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	4012daf769	btrfs: introduce read_extent_buffer_subpage() Introduce a helper, read_extent_buffer_subpage(), to do the subpage extent buffer read. The difference between regular and subpage routines are: - No page locking Here we completely rely on extent locking. Page locking can reduce the concurrency greatly, as if we lock one page to read one extent buffer, all the other extent buffers in the same page will have to wait. - Extent uptodate condition Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check, We also need to check btrfs_subpage::uptodate_bitmap. - No page iteration Just one page, no need to loop, this greatly simplified the subpage routine. This patch only implements the bio submit part, no endio support yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:03 +01:00
Qu Wenruo	d1e86e3fc3	btrfs: support subpage in try_release_extent_buffer() Unlike the original try_release_extent_buffer(), try_release_subpage_extent_buffer() will iterate through all the ebs in the page, and try to release each. We can release the full page only after there's no private attached, which means all ebs of that page have been released as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	92d83e9436	btrfs: support subpage in btrfs_clone_extent_buffer For btrfs_clone_extent_buffer(), it's mostly the same code of __alloc_dummy_extent_buffer(), except it has extra page copy. So to make it subpage compatible, we only need to: - Call set_extent_buffer_uptodate() instead of SetPageUptodate() This will set correct uptodate bit for subpage and regular sector size cases. Since we're calling set_extent_buffer_uptodate() which will also set EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit either. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	251f2acc71	btrfs: support subpage in set/clear_extent_buffer_uptodate() To support subpage in set_extent_buffer_uptodate and clear_extent_buffer_uptodate we only need to use the subpage-aware helpers to update the page bits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	03a816b32b	btrfs: introduce helpers for subpage error status Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	a1d767c11c	btrfs: introduce helpers for subpage uptodate status Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	09bc1f0fb8	btrfs: attach private to dummy extent buffer pages There are locations where we allocate dummy extent buffers for temporary usage, like in tree_mod_log_rewind() or get_old_root(). These dummy extent buffers will be handled by the same eb accessors, and if they don't have page::private subpage eb accessors could fail. To address such problems, make __alloc_dummy_extent_buffer() attach page private for dummy extent buffers too. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	8ff8466d29	btrfs: support subpage for extent buffer page release In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A) \| T2 (eb B) -------------------------------+------------------------------ alloc_extent_buffer() \| btrfs_release_extent_buffer_pages() \|- p = find_or_create_page() \| \| \|- attach_extent_buffer_page() \| \| \| \| \|- detach_extent_buffer_page() \| \| \|- if (!page_range_has_eb()) \| \| \| No new eb in the page range yet \| \| \| As new eb A hasn't yet been \| \| \| inserted into radix tree. \| \| \|- btrfs_detach_subpage() \| \| \|- detach_page_private(); \|- radix_tree_insert() \| Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:02 +01:00
Qu Wenruo	819822107d	btrfs: make grab_extent_buffer_from_page() handle subpage case For subpage case, grab_extent_buffer() can't really get an extent buffer just from btrfs_subpage. We have radix tree lock protecting us from inserting the same eb into the tree. Thus we don't really need to do the extra hassle, just let alloc_extent_buffer() handle the existing eb in radix tree. Now if two ebs are being allocated as the same time, one will fail with -EEIXST when inserting into the radix tree. So for grab_extent_buffer(), just always return NULL for subpage case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Qu Wenruo	760f991f14	btrfs: make attach_extent_buffer_page() handle subpage case For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Qu Wenruo	cac06d843f	btrfs: introduce the skeleton of btrfs_subpage structure For sectorsize < page size support, we need a structure to record extra status info for each sector of a page. Introduce the skeleton structure, all subpage related code would go to subpage.[ch]. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Qu Wenruo	62c053fbb2	btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support For the incoming subpage support, UNMAPPED extent buffer will have different behavior in btrfs_release_extent_buffer(). This means we need to set UNMAPPED bit early before calling btrfs_release_extent_buffer(). Currently there is only one caller which relies on btrfs_release_extent_buffer() in its error path while set UNMAPPED bit late: - btrfs_clone_extent_buffer() Make it subpage compatible by setting the UNMAPPED bit early, since we're here, also move the UPTODATE bit early. There is another caller, __alloc_dummy_extent_buffer(), setting UNMAPPED bit late, but that function clean up the allocated page manually, thus no need for any modification. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Qu Wenruo	6869b0a8be	btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in __process_pages_contig(), to let the function know to clear page dirty bit and then set page writeback. However page writeback and dirty bits are conflicting (at least for sector size == PAGE_SIZE case), this means these two have to be always updated together. This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Filipe Manana	d0c2f4fa55	btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" \| tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX `9627107` 0.153 61.938 Close 7072076 0.001 3.175 Rename 407633 1.222 44.439 Unlink 1943895 0.658 44.440 Deltree 256 17.339 110.891 Mkdir 128 0.003 0.009 Qpathinfo 8725406 0.064 17.850 Qfileinfo 1529516 0.001 2.188 Qfsinfo 1599884 0.002 1.457 Sfileinfo 784200 0.005 3.562 Find 3373513 0.411 30.312 WriteX 4802132 0.053 29.054 ReadX 15089959 0.002 5.801 LockX 31344 0.002 0.425 UnlockX 31344 0.001 0.173 Flush 674724 5.952 341.830 Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms After applying patchset, 32 clients: After patchset, with 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9931568 0.111 25.597 Close 7295730 0.001 2.171 Rename 420549 0.982 49.714 Unlink 2005366 0.497 39.015 Deltree 256 11.149 89.242 Mkdir 128 0.002 0.014 Qpathinfo 9001863 0.049 20.761 Qfileinfo 1577730 0.001 2.546 Qfsinfo 1650508 0.002 3.531 Sfileinfo 809031 0.005 5.846 Find 3480259 0.309 23.977 WriteX 4952505 0.043 41.283 ReadX 15568127 0.002 5.476 LockX 32338 0.002 0.978 UnlockX 32338 0.001 2.032 Flush 696017 7.485 228.835 Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms --> +4.1% throughput, -39.6% max latency Before applying patchset, 64 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 8956748 0.342 108.312 Close 6579660 0.001 3.823 Rename 379209 2.396 81.897 Unlink 1808625 1.108 131.148 Deltree 256 25.632 172.176 Mkdir 128 0.003 0.018 Qpathinfo 8117615 0.131 55.916 Qfileinfo 1423495 0.001 2.635 Qfsinfo 1488496 0.002 5.412 Sfileinfo 729472 0.007 8.643 Find 3138598 0.855 78.321 WriteX 4470783 0.102 79.442 ReadX 14038139 0.002 7.578 LockX 29158 0.002 0.844 UnlockX 29158 0.001 0.567 Flush 627746 14.168 506.151 Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms After applying patchset, 64 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9069003 0.303 43.193 Close 6662328 0.001 3.888 Rename 383976 2.194 46.418 Unlink `1831080` 1.022 43.873 Deltree 256 24.037 155.763 Mkdir 128 0.002 0.005 Qpathinfo 8219173 0.137 30.233 Qfileinfo 1441203 0.001 3.204 Qfsinfo 1507092 0.002 4.055 Sfileinfo 738775 0.006 5.431 Find 3177874 0.936 38.170 WriteX 4526152 0.084 39.518 ReadX 14213562 0.002 24.760 LockX 29522 0.002 1.221 UnlockX 29522 0.001 0.694 Flush 635652 14.358 422.039 Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms --> +6.8% throughput, -18.1% max latency Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Filipe Manana	64d6b281ba	btrfs: remove unnecessary check_parent_dirs_for_sync() Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:01 +01:00
Filipe Manana	0e44cb3f94	btrfs: skip logging inodes already logged when logging new entries When logging new directory entries of a directory, we log the inodes of new dentries and the inodes of dentries pointing to directories that may have been created in past transactions. For the case of directories we log in full mode, which can be particularly expensive for large directories. We do use btrfs_inode_in_log() to skip already logged inodes, however for that helper to return true, it requires that the log transaction used to log the inode to be already committed. This means that when we have more than one task using the same log transaction we can end up logging an inode multiple times, which is a waste of time and not necessary since the log will be committed by one of the tasks and the others will wait for the log transaction to be committed before returning to user space. So simply replace the use of btrfs_inode_in_log() with the new helper function need_log_inode(), introduced in a previous commit. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Filipe Manana	3e6a86a193	btrfs: skip logging directories already logged when logging all parents Some times when we fsync an inode we need to do a full log of all its ancestors (due to unlink, link or rename operations), which can be an expensive operation, specially if the directories are large. However if we find an ancestor directory inode that is already logged in the current transaction, and has no inserted/updated/deleted xattrs since it was last logged, we can skip logging the directory again. We are safe to skip that since we know that for logged directories, any link, unlink or rename operations that implicate the directory will update the log as necessary. So use the helper need_log_dir(), introduced in a previous commit, to detect already logged directories that can be skipped. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Filipe Manana	ab12313a9f	btrfs: avoid logging new ancestor inodes when logging new inode When we fsync a new file, created in the current transaction, we check all its ancestor inodes and always log them if they were created in the current transaction - even if we have already logged them before, which is a waste of time. So avoid logging new ancestor inodes if they were already logged before and have no xattrs added/updated/removed since they were last logged. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Filipe Manana	e593e54ed1	btrfs: stop setting nbytes when filling inode item for logging When we fill an inode item for logging we are setting its nbytes field with the value returned by inode_get_bytes() (a VFS API), however we do not need it because it is not used during log replay. In fact, for fast fsyncs, when we call inode_get_bytes() we may even get an outdated value for nbytes because the nbytes field of the inode is only updated when ordered extents complete, and a fast fsync only waits for writeback to complete, it does not wait for ordered extent completion. So just remove the setup of nbytes and add an explicit comment mentioning why we do not set it. This also avoids adding contention on the inode's i_lock (VFS) with concurrent stat() calls, since that spinlock is used by inode_get_bytes() which is also called by our stat callback (btrfs_getattr()). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Filipe Manana	ddffcf6fb5	btrfs: remove unnecessary directory inode item update when deleting dir entry When we remove a directory entry, as part of an unlink operation, if the directory was logged before we must remove the directory index items from the log. We are also updating the inode item of the directory to update its i_size, but that is not necessary because during log replay we do not need it and we correctly adjust the i_size in the inode item of the subvolume as we process directory index items and replay deletes. This is not needed since commit `d555438b6e` ("Btrfs: drop dir i_size when adding new names on replay"), where we explicitly ignore the i_size of directory inode items on log replay. Before that we used it but it was buggy as mentioned in that commit's change log (i_size got a larger value then it should have). So stop updating the i_size of the directory inode item in the log, as that is a waste of time, adds more log contention to the log tree and often results in COWing more extent buffers for the log tree. This code path is triggered often during dbench workloads for example. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Michal Rostecki	4203431319	btrfs: let callers of btrfs_get_io_geometry pass the em Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. Change that to passing the extent mapping to the btrfs_get_io_geometry() function as an argument. This could improve performance in some cases. For very large filesystems, i.e. several thousands of allocated chunks, not only this avoids searching two times the rbtree, saving time, it may also help reducing contention on the lock that protects the tree - thinking of writeback starting for multiple inodes, other tasks allocating or removing chunks, and anything else that requires access to the rbtree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Michal Rostecki <mrostecki@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add Filipe's analysis ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:59:00 +01:00
Qu Wenruo	951c80f83d	btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge Commit `dbfdb6d1b3` ("Btrfs: Search for all ordered extents that could span across a page") make btrfs_invalidapage() to search all ordered extents. The offending code looks like this: again: start = page_start; ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1); if (ordred) { end = min(page_end, ordered->file_offset + ordered->num_bytes - 1); /* Do the cleanup */ start = end + 1; if (start < page_end) goto again; } The behavior is indeed necessary for the incoming subpage support, but when it iterates through all the ordered extents, it also resets the search range @start. This means, for the following cases, we can double account the ordered extents, causing its bytes_left underflow: Page offset 0 16K 32K \|<--- OE 1 --->\|<--- OE 2 ---->\| As the first iteration will find ordered extent (OE) 1, which doesn't cover the full page, thus after cleanup code, we need to retry again. But again label will reset start to page_start, and we got OE 1 again, which causes double accounting on OE 1, and cause OE 1's byte_left to underflow. This problem can only happen for subpage case, as for regular sectorsize == PAGE_SIZE case, we will always find a OE ends at or after page end, thus no way to trigger the problem. Move the again label after start = page_start. There will be more comprehensive rework to convert the open coded loop to a proper while loop for subpage support. Fixes: `dbfdb6d1b3` ("Btrfs: Search for all ordered extents that could span across a page") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Abaci Team	a4559e6f6f	btrfs: simplify condition in __btrfs_run_delayed_items Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A \|\| A && B is equivalent to !A \|\| B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Filipe Manana	2965194b77	btrfs: remove wrong comment for can_nocow_extent() The comment for can_nocow_extent() says that the function will flush ordered extents, however that never happens and was never true before the comment was added in commit `e4ecaf90bc` ("btrfs: add comments for btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for the function btrfs_check_can_nocow(), which after that commit was renamed to check_can_nocow(). So just remove that part of the comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Josef Bacik	e5ad49e215	btrfs: add a trace class for dumping the current ENOSPC state Often when I'm debugging ENOSPC related issues I have to resort to printing the entire ENOSPC state with trace_printk() in different spots. This gets pretty annoying, so add a trace state that does this for us. Then add a trace point at the end of preemptive flushing so you can see the state of the space_info when we decide to exit preemptive flushing. This helped me figure out we weren't kicking in the preemptive flushing soon enough. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Josef Bacik	4b02b00fe5	btrfs: adjust the flush trace point to include the source Since we have normal ticketed flushing and preemptive flushing, adjust the tracepoint so that we know the source of the flushing action to make it easier to debug problems. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Josef Bacik	88a777a6e5	btrfs: implement space clamping for preemptive flushing Starting preemptive flushing at 50% of available free space is a good start, but some workloads are particularly abusive and can quickly overwhelm the preemptive flushing code and drive us into using tickets. Handle this by clamping down on our threshold for starting and continuing to run preemptive flushing. This is particularly important for our overcommit case, as we can really drive the file system into overages and then it's more difficult to pull it back as we start to actually fill up the file system. The clamping is essentially 2^CLAMP, but we start at 1 so whatever we calculate for overcommit is the baseline. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Josef Bacik	2e294c6049	btrfs: simplify the logic in need_preemptive_flushing A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing if we've reserved more than half of our available free space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:59 +01:00
Josef Bacik	9f42d37748	btrfs: rework btrfs_calc_reclaim_metadata_size Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no tickets it calculates a value for the preemptive flushing. However for the normal ticketed flushing we really only care about the space required for tickets. We will accidentally come in and flush one time, but as soon as we see there are no tickets we bail out of our flushing. Fix this by making btrfs_calc_reclaim_metadata_size really only tell us what is required for flushing if we have people waiting on space. Then move the preemptive flushing logic into need_preemptive_reclaim(). We ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim() because if we are in this path then we made our reservation and there are not pending tickets currently, so we do not need to check it, simply do the fuzzy logic to check if we're getting low on space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	f205edf773	btrfs: check reclaim_size in need_preemptive_reclaim If we're flushing space for tickets then we have space_info->reclaim_size set and we do not need to do background reclaim. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	ae7913ba52	btrfs: rename need_do_async_reclaim All of our normal flushing is asynchronous reclaim, so this helper is poorly named. This is more checking if we need to preemptively flush space, so rename it to need_preemptive_reclaim. Also switch it to bool and make it plain static as followup patches will move more code here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	576fa34830	btrfs: improve preemptive background space flushing Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematically flush things in the filesystem that hold space reservations until our space is reclaimed. However this has a latency cost, we must go to sleep and wait for the flushing to make progress before we are woken up and allowed to continue doing our work. In order to address that we used to kick off the async worker to flush space preemptively, so that we could be reclaiming space hopefully before any tasks needed to stop and wait for space to reclaim. When I introduced the ticketed ENOSPC stuff this broke slightly in the fact that we were using tickets to indicate if we were done flushing. No tickets, no more flushing. However this meant that we essentially never preemptively flushed. This caused a write performance regression that Nikolay noticed in an unrelated patch that removed the committing of the transaction during btrfs_end_transaction. The behavior that happened pre that patch was btrfs_end_transaction() would see that we were low on space, and it would commit the transaction. This was bad because in this particular case you could end up with thousands and thousands of transactions being committed during the 5 minute reproducer. With the patch to remove this behavior we got much more sane transaction commits, but we ended up slower because we would write for a while, flush, write for a while, flush again. To address this we need to reinstate a preemptive flushing mechanism. However it is distinctly different from our ticketing flushing in that it doesn't have tickets to base it's decisions on. Instead of bolting this logic into our existing flushing work, add another worker to handle this preemptive flushing. Here we will attempt to be slightly intelligent about the things that we flushing, attempting to balance between whichever pool is taking up the most space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	f00c42dd4c	btrfs: introduce a FORCE_COMMIT_TRANS flush operation Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	5deb17e18e	btrfs: track ordered bytes instead of just dio ordered bytes We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The reason for this is because we can't "flush" DIO, we have to just wait on the ordered extents to finish. However this is true of all ordered extents. If we have more ordered space outstanding than dirty pages we should be waiting on ordered extents. We already are ok on this front technically, because we always do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in the preemptive flushing code as well, so change this to count all ordered bytes instead of just DIO ordered bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	ac1ea10e75	btrfs: add a trace point for reserve tickets While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point to report when we handle a reserve ticket. I opted to spit out start_ns itself without calculating the difference because there could be a gap between enabling the tracepoint and setting start_ns. Doing it this way allows us to filter on 0 start_ns so we don't get bogus entries, and we can easily calculate the time difference with bpftrace or something else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:58 +01:00
Josef Bacik	91e79a83ff	btrfs: make flush_space take a enum btrfs_flush_state instead of int I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space, caught by -Wenum-conversion. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Roman Anasal	8898038309	btrfs: send: use struct send_ctx sctx for btrfs_compare_trees and changed_cb btrfs_compare_trees and changed_cb use a void ctx parameter instead of struct send_ctx sctx but when used in changed_cb it is immediately cast to `struct send_ctx sctx = ctx;`. changed_cb is only ever called from btrfs_compare_trees and full_send_tree: - full_send_tree already passes a struct send_ctx sctx - btrfs_compare_trees is only called by send_subvol with a struct send_ctx sctx - void ctx in btrfs_compare_trees is only used to be passed to changed_cb So casting to/from void ctx seems unnecessary and directly using struct send_ctx sctx instead provides better type-safety. The original reason for using void ctx in the first place seems to have been dropped with `1b51d6fce4` ("btrfs: send: remove indirect callback parameter for changed_cb"). Signed-off-by: Roman Anasal <roman.anasal@bdsu.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Josef Bacik	488bc2a2d2	btrfs: run delayed refs less often in commit_cowonly_roots We love running delayed refs in commit_cowonly_roots, but it is a bit excessive. I was seeing cases of running 3 or 4 refs a few times in a row during this time. Instead simply: - update all of the roots first - then run delayed refs - then handle the empty block groups case - and then if we have any more dirty roots do the whole thing again This allows us to be much more efficient with our delayed ref running, as we can batch a few more operations at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Josef Bacik	dac348e925	btrfs: stop running all delayed refs during snapshot This was added in commit `361048f586` ("Btrfs: fix full backref problem when inserting shared block reference") to address a problem where we hit the following BUG_ON() in alloc_reserved_tree_block if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) { BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)); However this BUG_ON() is bogus, and was removed by previous commit: btrfs: remove bogus BUG_ON in alloc_reserved_tree_block We no longer need to run delayed refs because of this, and can remove this flushing here. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Josef Bacik	b7774425e0	btrfs: remove bogus BUG_ON in alloc_reserved_tree_block The fix `361048f586` ("Btrfs: fix full backref problem when inserting shared block reference") added a delayed ref flushing at subvolume creation time in order to avoid hitting this particular BUG_ON(). Before this fix, we were tripping the BUG_ON() by 1. Modify snapshot A, which creates blocks with a normal reference for snapshot A, as A is the owner of these blocks. We now have delayed refs for these blocks. 2. Create a snapshot of A named B, which pushes references for the children blocks of the root node for the new root B, thus creating more delayed refs for newly allocated blocks. 3. A is modified, and because the metadata blocks can now be shared, it must push FULL_BACKREF references to the children of any block that A COWs down it's path to its target key. 4. Delayed refs are run. Because these are newly allocated blocks, we have ->must_insert_reserved reserved set on the delayed ref head, we call into alloc_reserved_tree_block() to add the extent item, and then add our ref. At the time of this fix, we were ordering FULL_BACKREF delayed ref operations first, so we'd go to add this reference and then BUG_ON() because we didn't have the FULL_BACKREF flag set. The patch fixed this problem by making sure we ran the delayed refs before we had the chance to modify A. This meant that any new blocks would have had their extent items created _before_ we would ever actually COW down and generate FULL_BACKREF entries. Thus the problem went away. However this BUG_ON() is actually completely bogus. The existence of a full backref doesn't necessarily mean that FULL_BACKREF must be set on that block, it must only be set on the actual parent itself. Consider the example provided above. If we COW down one path from A, any nodes are going to have a FULL_BACKREF ref pushed down to _all_ of their children, but not all of the children are going to have FULL_BACKREF set. It is completely valid to have an extent item with normal and full backrefs without FULL_BACKREF actually set on the block itself. As a final note, I have been testing with the patch (applied after this one) btrfs: stop running all delayed refs during snapshot which removed this flushing. My test was a torture test which did a lot of operations while snapshotting and deleting snapshots as well as relocation, and I never tripped this BUG_ON(). This is actually because at the time of `361048f586`, we ordered SHARED keys _before_ normal references, and thus they would get run first. However currently they are ordered _after_ normal references, so we'd do the initial creation without having a shared reference, and thus not hit this BUG_ON(), which explains why I didn't start hitting this problem during my testing with my other patch applied. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Josef Bacik	2a4d84c11a	btrfs: move delayed ref flushing for qgroup into qgroup helper The commit `d672633545` ("btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.") added a flush of the delayed refs during snapshot creation in order to get the qgroup accounting properly. However this code has changed and been moved to it's own helper that is skipped if qgroups are turned off. Move the flushing to the helper, as we do not need it when qgroups are turned off. Also add a comment explaining why it exists, and why it doesn't actually save us. This will be helpful later when we try to fix qgroup accounting properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:57 +01:00
Josef Bacik	ad368f3394	btrfs: only run delayed refs once before committing We try to pre-flush the delayed refs when committing, because we want to do as little work as possible in the critical section of the transaction commit. However doing this twice can lead to very long transaction commit delays as other threads are allowed to continue to generate more delayed refs, which potentially delays the commit by multiple minutes in very extreme cases. So simply stick to one pre-flush, and then continue the rest of the transaction commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	61a56a992f	btrfs: delayed refs pre-flushing should only run the heads we have Previously our delayed ref running used the total number of items as the items to run. However we changed that to number of heads to run with the delayed_refs_rsv, as generally we want to run all of the operations for one bytenr. But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the number of items that we have. This is generally fine, but if we have some operation generation loads of delayed refs while we're doing this pre-flushing in the transaction commit, we'll just spin forever doing delayed refs. Fix this to simply pick the number of delayed refs we currently have, that way we do not end up doing a lot of extra work that's being generated in other threads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	e19eb11f4f	btrfs: only let one thread pre-flush delayed refs in commit I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. The memory barrier is not necessary for the flushing bit as it's ordered, unlike plain int. The transaction state accessed in btrfs_should_end_transaction could be affected by that too as it's not always used under transaction lock. Upon Nikolay's analysis in [1] it's not necessary: In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a WRITE, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ\|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point KCSAN would flag such an access as racy (which it is). [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comments regarding memory barrier ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	ddfd08cb04	btrfs: do not block on deleted bgs mutex in the cleaner While running some stress tests I started getting hung task messages. This is because the delete unused block groups code has to take the delete_unused_bgs_mutex to do it's work, which is taken by balance to make sure we don't delete block groups while we're balancing. The problem is that balance can take a while, and so we were getting hung task warnings. We don't need to block and run these things, and the cleaner is needed to do other work, so trylock on this mutex and just bail if we can't acquire it right away. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	867ed321f9	btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root While testing my error handling patches, I added a error injection site at btrfs_inc_extent_ref, to validate the error handling I added was doing the correct thing. However I hit a pretty ugly corruption while doing this check, with the following error injection stack trace: btrfs_inc_extent_ref btrfs_copy_root create_reloc_root btrfs_init_reloc_root btrfs_record_root_in_trans btrfs_start_transaction btrfs_update_inode btrfs_update_time touch_atime file_accessed btrfs_file_mmap This is because we do not catch the error from btrfs_inc_extent_ref, which in practice would be ENOMEM, which means we lose the extent references for a root that has already been allocated and inserted, which is the problem. Fix this by aborting the transaction if we fail to do the reference modification. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	eddda68d97	btrfs: add asserts for deleting backref cache nodes A weird KASAN problem that Zygo reported could have been easily caught if we checked for basic things in our backref freeing code. We have two methods of freeing a backref node - btrfs_backref_free_node: this just is kfree() essentially. - btrfs_backref_drop_node: this actually unlinks the node and cleans up everything and then calls btrfs_backref_free_node(). We should mostly be using btrfs_backref_drop_node(), to make sure the node is properly unlinked from the backref cache, and only use btrfs_backref_free_node() when we know the node isn't actually linked to the backref cache. We made a mistake here and thus got the KASAN splat. Make this style of issue easier to find by adding some ASSERT()'s to btrfs_backref_free_node() and adjusting our deletion stuff to properly init the list so we can rely on list_empty() checks working properly. BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420 Read of size 8 at addr ffff888112402950 by task btrfs/28836 CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xbc/0xf9 ? btrfs_backref_cleanup_node+0x18a/0x420 print_address_description.constprop.8+0x21/0x210 ? record_print_text.cold.34+0x11/0x11 ? btrfs_backref_cleanup_node+0x18a/0x420 ? btrfs_backref_cleanup_node+0x18a/0x420 kasan_report.cold.10+0x20/0x37 ? btrfs_backref_cleanup_node+0x18a/0x420 __asan_load8+0x69/0x90 btrfs_backref_cleanup_node+0x18a/0x420 btrfs_backref_release_cache+0x83/0x1b0 relocate_block_group+0x394/0x780 ? merge_reloc_roots+0x4a0/0x4a0 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 ? check_flags.part.50+0x6c/0x1e0 ? btrfs_relocate_chunk+0x120/0x120 ? kmem_cache_alloc_trace+0xa06/0xcb0 ? _copy_from_user+0x83/0xc0 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4c4bdfe427 RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427 RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001 R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001 Allocated by task 28836: kasan_save_stack+0x21/0x50 __kasan_kmalloc.constprop.18+0xbe/0xd0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x410/0xcb0 btrfs_backref_alloc_node+0x46/0xf0 btrfs_backref_add_tree_node+0x60d/0x11d0 build_backref_tree+0xc5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 28836: kasan_save_stack+0x21/0x50 kasan_set_track+0x20/0x30 kasan_set_free_info+0x1f/0x30 __kasan_slab_free+0xf3/0x140 kasan_slab_free+0xe/0x10 kfree+0xde/0x200 btrfs_backref_error_cleanup+0x452/0x530 build_backref_tree+0x1a5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888112402900 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 80 bytes inside of 128-byte region [ffff888112402900, ffff888112402980) The buggy address belongs to the page: page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402 flags: 0x17ffe0000000200(slab) raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440 raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/ CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	f78743fbda	btrfs: do not warn if we can't find the reloc root when looking up backref The backref code is looking for a reloc_root that corresponds to the given fs root. However any number of things could have gone wrong while initializing that reloc_root, like ENOMEM while trying to allocate the root itself, or EIO while trying to write the root item. This would result in no corresponding reloc_root being in the reloc root cache, and thus would return NULL when we do the find_reloc_root() call. Because of this we do not want to WARN_ON(). This presumably was meant to catch developer errors, cases where we messed up adding the reloc root. However we can easily hit this case with error injection, and thus should not do a WARN_ON(). CC: stable@vger.kernel.org # 5.10+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:56 +01:00
Josef Bacik	938fcbfb0c	btrfs: splice remaining dirty_bg's onto the transaction dirty bg list While doing error injection testing with my relocation patches I hit the following assert: assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3357! invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282 RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000 RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050 RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000 R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100 R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100 FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0 Call Trace: btrfs_free_block_groups.cold+0x55/0x55 close_ctree+0x2c5/0x306 ? fsnotify_destroy_marks+0x14/0x100 generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1b1/0x1d0 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happened because I injected an error in btrfs_cow_block() while running the dirty block groups. When we run the dirty block groups, we splice the list onto a local list to process. However if an error occurs, we only cleanup the transactions dirty block group list, not any pending block groups we have on our locally spliced list. In fact if we fail to allocate a path in this function we'll also fail to clean up the splice list. Fix this by splicing the list back onto the transaction dirty block group list so that the block groups are cleaned up. Then add a 'out' label and have the error conditions jump to out so that the errors are handled properly. This also has the side-effect of fixing a problem where we would clear 'ret' on error because we unconditionally ran btrfs_run_delayed_refs(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Josef Bacik	c78a10aebb	btrfs: fix reloc root leak with 0 ref reloc roots on recovery When recovering a relocation, if we run into a reloc root that has 0 refs we simply add it to the reloc_control->reloc_roots list, and then clean it up later. The problem with this is __del_reloc_root() doesn't do anything if the root isn't in the radix tree, which in this case it won't be because we never call __add_reloc_root() on the reloc_root. This exit condition simply isn't correct really. During normal operation we can remove ourselves from the rb tree and then we're meant to clean up later at merge_reloc_roots() time, and this happens correctly. During recovery we're depending on free_reloc_roots() to drop our references, but we're short-circuiting. Fix this by continuing to check if we're on the list and dropping ourselves from the reloc_control root list and dropping our reference appropriately. Change the corresponding BUG_ON() to an ASSERT() that does the correct thing if we aren't in the rb tree. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Nigel Christian	2e626e5673	btrfs: remove repeated word in struct member comment Comment for processed extent end of range has an unnecessary "in", remove it. Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Josef Bacik	81e75ac74e	btrfs: account for new extents being deleted in total_bytes_pinned My recent patch set "A variety of lock contention fixes", found here https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/ (Tracked in https://github.com/btrfs/linux/issues/86) that reduce lock contention on the extent root by running delayed refs less often resulted in a regression in generic/371. This test fallocate()'s the fs until it's full, deletes all the files, and then tries to fallocate() until full again. Before these patches we would run all of the delayed refs during flushing, and then would commit the transaction because we had plenty of pinned space to recover in order to allocate. However my patches made it so we weren't running the delayed refs as aggressively, which meant that we appeared to have less pinned space when we were deciding to commit the transaction. We use the space_info->total_bytes_pinned to approximate how much space we have pinned. It's approximate because if we remove a reference to an extent we may free it, but there may be more references to it than we know of at that point, but we account it as pinned at the creation time, and then it's properly accounted when the delayed ref runs. The way we account for pinned space is if the delayed_ref_head->total_ref_mod is < 0, because that is clearly a freeing option. However there is another case, and that is where ->total_ref_mod == 0 && ->must_insert_reserved == 1. When we allocate a new extent, we have ->total_ref_mod == 1 and we have ->must_insert_reserved == 1. This is used to indicate that it is a brand new extent and will need to have its extent entry added before we modify any references on the delayed ref head. But if we subsequently remove that extent reference, our ->total_ref_mod will be 0, and that space will be pinned and freed. Accounting for this case properly allows for generic/371 to pass with my delayed refs patches applied. It's important to note that this problem exists without the referenced patches, it just was uncovered by them. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Josef Bacik	2187374f35	btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself Currently we pass things around to figure out if we maybe freeing data based on the state of the delayed refs head. This makes the accounting sort of confusing and hard to follow, as it's distinctly separate from the delayed ref heads stuff, but also depends on it entirely. Fix this by explicitly adjusting the space_info->total_bytes_pinned in the delayed refs code. We now have two places where we modify this counter, once where we create the delayed and destroy the delayed refs, and once when we pin and unpin the extents. This means there is a slight overlap between delayed refs and the pin/unpin mechanisms, but this is simply used by the ENOSPC infrastructure to determine if we need to commit the transaction, so there's no adverse affect from this, we might simply commit thinking it will give us enough space when it might not. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Nikolay Borisov	e9aa7c285d	btrfs: enable W=1 checks for btrfs Now that the btrfs' codebase is clean of almost all W=1 warnings let's enable those checks unconditionally for the entire fs/btrfs/ and its subdirectories to catch potential errors during development. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:55 +01:00
Nikolay Borisov	8c31a3dbaa	btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev This fixes warning: fs/btrfs/zoned.c:491:6: warning: variable ‘zone_size’ set but not used [-Wunused-but-set-variable] 491 \| u64 zone_size; which got introduced in `12659251ca` ("btrfs: implement log-structured superblock for ZONED mode"). We'll enable the warning by default and want clean build until the relevant zoned patches land. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	3bed2da1b0	btrfs: fix parameter description for functions in extent_io.c This makes the file W=1 clean and fixes the following warnings: fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages' fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	d98b188ea4	btrfs: fix parameter description in space-info.c With these fixes space-info.c is clear for W=1 warnings, namely the following ones are fixed: fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	b762d1d08d	btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space Fixes following warnings: fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'inode' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'reserved' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Excess function parameter 'release_bytes' description in 'btrfs_delalloc_release_space' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	6e353e3b3c	btrfs: document btrfs_check_shared parameters Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	2639631d34	btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:54 +01:00
Nikolay Borisov	9ee9b97990	btrfs: document fs_info in btrfs_rmap_block Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	9241969547	btrfs: document now parameter of peek_discard_list Fixes fs/btrfs/discard.c:203: warning: Function parameter or member 'now' not described in 'peek_discard_list' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	f092cf3cfd	btrfs: improve parameter description for __btrfs_write_out_cache Fixes following W=1 warnings: fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	696eb22b67	btrfs: fix parameter description in delayed-ref.c functions This fixes the following warnings: fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'nr' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'fs_info' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'src' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'num_bytes' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_refill' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'flush' not described in 'btrfs_delayed_refs_rsv_refill' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	ca4207ae13	btrfs: fix function description formats in file-item.c This fixes following W=1 warnings: fs/btrfs/file-item.c:27: warning: Cannot understand * @inode: the inode we want to update the disk_i_size for on line 27 - I thought it was a doc line fs/btrfs/file-item.c:65: warning: Cannot understand * @inode - the inode we're modifying on line 65 - I thought it was a doc line fs/btrfs/file-item.c:91: warning: Cannot understand * @inode - the inode we're modifying on line 91 - I thought it was a doc line Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	9ad37bb3ff	btrfs: fix parameter description of btrfs_add_extent_mapping This fixes the following compiler warnings: fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Nikolay Borisov	401bd2dd12	btrfs: document modified parameter of add_extent_mapping Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member 'modified' not described in 'add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:53 +01:00
Qu Wenruo	3c198fe064	btrfs: rework the order of btrfs_ordered_extent::flags [BUG] There is a long existing bug in the last parameter of btrfs_add_ordered_extent(), in commit `771ed689d2` ("Btrfs: Optimize compressed writeback and reads") back to 2008. In that ancient commit btrfs_add_ordered_extent() expects the @type parameter to be one of the following: - BTRFS_ORDERED_REGULAR - BTRFS_ORDERED_NOCOW - BTRFS_ORDERED_PREALLOC - BTRFS_ORDERED_COMPRESSED But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE. Ironically extra check in __btrfs_add_ordered_extent() won't set the bit if we see (type == IO_DONE \|\| type == IO_COMPLETE), and avoid any obvious bug. But this still leads to regular COW ordered extent having no bit to indicate its type in various trace events, rendering REGULAR bit useless. [FIX] Change the following aspects to avoid such problem: - Reorder btrfs_ordered_extent::flags Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR. - Add extra ASSERT() for btrfs_add_ordered_extent_*() - Remove @type parameter for btrfs_add_ordered_extent_compress() As the only valid @type here is BTRFS_ORDERED_COMPRESSED. - Remove the unnecessary special check for IO_DONE/COMPLETE in __btrfs_add_ordered_extent() This is just to make the code work, with extra ASSERT(), there are limited values can be passed in. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Yang Li	fe3b7bb085	btrfs: remove redundant NULL check before kvfree Fix below warnings reported by coccicheck: ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing functions is not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Josef Bacik	7e2a870a59	btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node Zygo reported the following panic when testing my error handling patches for relocation: kernel BUG at fs/btrfs/backref.c:2545! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14 Hardware name: QEMU Standard PC (i440FX + PIIX, Call Trace: btrfs_backref_error_cleanup+0x4df/0x530 build_backref_tree+0x1a5/0x700 ? _raw_spin_unlock+0x22/0x30 ? release_extent_buffer+0x225/0x280 ? free_extent_buffer.part.52+0xd7/0x140 relocate_tree_blocks+0x2a6/0xb60 ? kasan_unpoison_shadow+0x35/0x50 ? do_relocation+0xc10/0xc10 ? kasan_kmalloc+0x9/0x10 ? kmem_cache_alloc_trace+0x6a3/0xcb0 ? free_extent_buffer.part.52+0xd7/0x140 ? rb_insert_color+0x342/0x360 ? add_tree_block.isra.36+0x236/0x2b0 relocate_block_group+0x2eb/0x780 ? merge_reloc_roots+0x470/0x470 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x18f0 ? pvclock_clocksource_read+0xeb/0x190 ? btrfs_relocate_chunk+0x120/0x120 ? lock_contended+0x620/0x6e0 ? do_raw_spin_lock+0x1e0/0x1e0 ? do_raw_spin_unlock+0xa8/0x140 btrfs_ioctl_balance+0x1f9/0x460 btrfs_ioctl+0x24c8/0x4380 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurs because of this check if (RB_EMPTY_NODE(&upper->rb_node)) BUG_ON(!list_empty(&node->upper)); As we are dropping the backref node, if we discover that our upper node in the edge we just cleaned up isn't linked into the cache that we are now done with this node, thus the BUG_ON(). However this is an erroneous assumption, as we will look up all the references for a node first, and then process the pending edges. All of the 'upper' nodes in our pending edges won't be in the cache's rb_tree yet, because they haven't been processed. We could very well have many edges still left to cleanup on this node. The fact is we simply do not need this check, we can just process all of the edges only for this node, because below this check we do the following if (list_empty(&upper->lower)) { list_add_tail(&upper->lower, &cache->leaves); upper->lowest = 1; } If the upper node truly isn't used yet, then we add it to the cache->leaves list to be cleaned up later. If it is still used then the last child node that has it linked into its node will add it to the leaves list and then it will be cleaned up. Fix this problem by dropping this logic altogether. With this fix I no longer see the panic when testing with error injection in the backref code. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Josef Bacik	f7ba2d3751	btrfs: keep track of the root owner for relocation reads While testing the error paths in relocation, I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc3+ #206 Not tainted ------------------------------------------------------ btrfs-balance/1571 is trying to acquire lock: ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0 but task is already holding lock: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-00){++++}-{3:3}: down_write_nested+0x43/0x80 __btrfs_tree_lock+0x27/0x100 btrfs_search_slot+0x248/0x890 relocate_tree_blocks+0x490/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 -> #1 (btrfs-csum-01){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x5ab/0x890 btrfs_del_csums+0x10b/0x3c0 __btrfs_free_extent+0x49d/0x8e0 __btrfs_run_delayed_refs+0x283/0x11f0 btrfs_run_delayed_refs+0x86/0x220 btrfs_start_dirty_block_groups+0x2ba/0x520 kretprobe_trampoline+0x0/0x50 -> #0 (&head_ref->mutex){+.+.}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 __mutex_lock+0x7e/0x7b0 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Chain exists of: &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-csum-01); lock(btrfs-tree-00); lock(&head_ref->mutex); * DEADLOCK * 5 locks held by btrfs-balance/1571: #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40 #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300 #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0 #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 stack backtrace: CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 ? btrfs_lookup_extent_info+0x156/0x3b0 __mutex_lock+0x7e/0x7b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? release_extent_buffer+0x124/0x170 ? _raw_spin_unlock+0x1f/0x30 ? release_extent_buffer+0x124/0x170 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 ? add_tree_block+0x149/0x1b0 relocate_block_group+0x1ba/0x5d0 elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_balance+0x796/0xf40 ? __kthread_parkme+0x66/0x90 ? btrfs_balance+0xf40/0xf40 ? balance_kthread+0x37/0x50 ? kthread+0x137/0x150 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x30 As you can see this is bogus, we never take another tree's lock under the csum lock. This happens because sometimes we have to read tree blocks from disk without knowing which root they belong to during relocation. We defaulted to an owner of 0, which translates to an fs tree. This is fine as all fs trees have the same class, but obviously isn't fine if the block belongs to a COW only tree. Thankfully COW only trees only have their owners root as a reference to them, and since we already look up the extent information during relocation, go ahead and check and see if this block might belong to a COW only tree, and if so save the owner in the tree_block struct. This allows us to read_tree_block with the proper owner, which gets rid of this lockdep splat. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Qu Wenruo	c0f0a9e716	btrfs: introduce helper to grab an existing extent buffer from a page This patch will extract the code to grab an extent buffer from a page into a helper, grab_extent_buffer_from_page(). This reduces one indent level, and provides the work place for later expansion for subapge support. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Qu Wenruo	c0fab48095	btrfs: update comment for btrfs_dirty_pages The original comment is from the initial merge, which has several problems: - No holes check any more - No inline decision is made Update the out-of-date comment with more correct one. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Qu Wenruo	6bc5636a67	btrfs: refactor __extent_writepage_io() to improve readability The refactoring involves the following modifications: - iosize alignment In fact we don't really need to manually do alignment at all. All extent maps should already be aligned, thus basic ASSERT() check would be enough. - redundant variables We have extra variable like blocksize/pg_offset/end. They are all unnecessary. @blocksize can be replaced by sectorsize size directly, and it's only used to verify the em start/size is aligned. @pg_offset can be easily calculated using @cur and page_offset(page). @end is just assigned from @page_end and never modified, use "start + PAGE_SIZE - 1" directly and remove @page_end. - remove some BUG_ON()s The BUG_ON()s are for extent map, which we have tree-checker to check on-disk extent data item and runtime check. ASSERT() should be enough. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:52 +01:00
Qu Wenruo	0c64c33c60	btrfs: rename parameter offset to disk_bytenr in submit_extent_page The parameter offset is confusing, it's supposed to be the disk bytenr of metadata/data. Rename it to disk_bytenr and update the comment. Also rename each offset passed to submit_extent_page() as @disk_bytenr so they're consistent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Qu Wenruo	58f74b2203	btrfs: refactor btrfs_dec_test_* functions for ordered extents The refactoring involves the following modifications: - Return bool instead of int - Parameter update for @cached of btrfs_dec_test_first_ordered_pending() For btrfs_dec_test_first_ordered_pending(), @cached is only used to return the finished ordered extent. Rename it to @finished_ret. - Comment updates * Change one stale comment Which still refers to btrfs_dec_test_ordered_pending(), but the context is calling btrfs_dec_test_first_ordered_pending(). * Follow the common comment style for both functions Add more detailed descriptions for parameters and the return value * Move the reason why test_and_set_bit() is used into the call sites - Change how the return value is calculated The most anti-human part of the return value is: if (...) ret = 1; ... return ret == 0; This means, when we set ret to 1, the function returns 0. Change the local variable name to @finished, and directly return the value of it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Qu Wenruo	523929f1ca	btrfs: make btrfs_dio_private::bytes u32 btrfs_dio_private::bytes is only assigned from bio::bi_iter::bi_size, which is never larger than U32. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Nikolay Borisov	d7830b7155	btrfs: remove always true condition in btrfs_start_delalloc_roots Following the rework in `e076ab2a2c` ("btrfs: shrink delalloc pages instead of full inodes") the nr variable is no longer passed by reference to start_delalloc_inodes hence it cannot change. Additionally we are always guaranteed for it to be positive number hence it's redundant to have it as a condition in the loop. Simply remove that usage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Nikolay Borisov	9db4dc241e	btrfs: make btrfs_start_delalloc_root's nr argument a long It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Filipe Manana	9c4a062a94	btrfs: send: remove stale code when checking for shared extents After commit `040ee6120c` ("Btrfs: send, improve clone range") we do not use anymore the data_offset field of struct backref_ctx, as after that we do all the necessary checks for the data offset of file extent items at clone_range(). Since there are no more users of data_offset from that structure, remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Nikolay Borisov	7056bf69e5	btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device Instead of having three 'if' to handle non-NULL return value consolidate this in one 'if (ret)'. That way the code is more obvious: - Always drop delete_unused_bgs_mutex if ret is not NULL - If ret is negative -> goto done - If it's 1 -> reset ret to 0, release the path and finish the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:51 +01:00
Josef Bacik	1478143ac8	btrfs: ref-verify: make sure owner is set for all refs I noticed that shared ref entries in ref-verify didn't have the proper owner set, which caused me to think there was something seriously wrong. However the problem is if we have a parent we simply weren't filling out the owner part of the reference, even though we have it. Fix this by making sure we set all the proper fields when we modify a reference, this way we'll have the proper owner if a problem happens and we don't waste time thinking we're updating the wrong level. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Josef Bacik	0d73a11c62	btrfs: ref-verify: pass down tree block level when building refs I noticed that sometimes I would have the wrong level printed out with ref-verify while testing some error injection related problems. This is because we only get the level from the main extent item, but our references could go off the current leaf into another, and at that point we lose our level. Fix this by keeping track of the last tree block level that we found, the same way we keep track of our bytenr and num_bytes, in case we happen to wander into another leaf while still processing the references for a bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Josef Bacik	1fec12a560	btrfs: noinline btrfs_should_cancel_balance I was attempting to reproduce a problem that Zygo hit, but my error injection wasn't firing for a few of the common calls to btrfs_should_cancel_balance. This is because the compiler decided to inline it at these spots. Keep this from happening by explicitly marking the function as noinline so that error injection will always work. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Josef Bacik	f75e2b79b5	btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block The following patches are going to address error handling in relocation, in order to test those patches I need to be able to inject errors in btrfs_search_slot and btrfs_cow_block, as we call both of these pretty often in different cases during relocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Nikolay Borisov	69948022c9	btrfs: remove new_dirid argument from btrfs_create_subvol_root It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Nikolay Borisov	23125104d8	btrfs: make btrfs_root::free_objectid hold the next available objectid Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:50 +01:00
Nikolay Borisov	6b8fad576a	btrfs: rename btrfs_root::highest_objectid to free_objectid This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Nikolay Borisov	543068a217	btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Nikolay Borisov	453e487386	btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid This function is used to initialize the in-memory btrfs_root::highest_objectid member, which is used to get an available objectid. Rename it to better reflect its semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Nikolay Borisov	149716570b	btrfs: cleanup local variables in btrfs_file_write_iter First replace all inode instances with a pointer to btrfs_inode. This removes multiple invocations of the BTRFS_I macro, subsequently remove 2 local variables as they are called only once and simply refer to them directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Zhihao Cheng	3cc64e7ebf	btrfs: clarify error returns values in __load_free_space_cache Return value in __load_free_space_cache is not properly set after (unlikely) memory allocation failures and 0 is returned instead. This is not a problem for the caller load_free_space_cache because only value 1 is considered as 'cache loaded' but for clarity it's better to set the errors accordingly. Fixes: `a67509c300` ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Josef Bacik	4f4317c13a	btrfs: fix error handling in commit_fs_roots While doing error injection I would sometimes get a corrupt file system. This is because I was injecting errors at btrfs_search_slot, but would only do it one time per stack. This uncovered a problem in commit_fs_roots, where if we get an error we would just break. However we're in a nested loop, the first loop being a loop to find all the dirty fs roots, and then subsequent root updates would succeed clearing the error value. This isn't likely to happen in real scenarios, however we could potentially get a random ENOMEM once and then not again, and we'd end up with a corrupted file system. Fix this by moving the error checking around a bit to the main loop, as this is the only place where something will fail, and return the error as soon as it occurs. With this patch my reproducer no longer corrupts the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2021-02-08 22:58:49 +01:00
Jaegeuk Kim	d50dfc0c7d	f2fs: don't grab superblock freeze for flush/ckpt thread There are controlled by f2fs_freeze(). This fixes xfstests/generic/068 which is stuck at task:f2fs_ckpt-252:3 state:D stack: 0 pid: 5761 ppid: 2 flags:0x00004000 Call Trace: __schedule+0x44c/0x8a0 schedule+0x4f/0xc0 percpu_rwsem_wait+0xd8/0x140 ? percpu_down_write+0xf0/0xf0 __percpu_down_read+0x56/0x70 issue_checkpoint_thread+0x12c/0x160 [f2fs] ? wait_woken+0x80/0x80 kthread+0x114/0x150 ? __checkpoint_and_complete_reqs+0x110/0x110 [f2fs] ? kthread_park+0x90/0x90 ret_from_fork+0x22/0x30 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-08 13:42:21 -08:00
Pavel Begunkov	0e9ddb39b7	io_uring: cleanup up cancel SQPOLL reqs across exec For SQPOLL rings tctx_inflight() always returns zero, so it might skip doing full cancelation. It's fine because we jam all sqpoll submissions in any case and do go through files cancel for them, but not nice. Do the intended full cancellation, by mimicking __io_uring_task_cancel() waiting but impersonating SQPOLL task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-08 08:27:25 -07:00
Yang Li	093e0687c5	jfs: turn diLog(), dataLog() and txLog() into void functions These functions always return '0' and no callers use the return value. So make it a void function. This eliminates the following coccicheck warning: ./fs/jfs/jfs_txnmgr.c:1365:5-7: Unneeded variable: "rc". Return "0" on line 1414 ./fs/jfs/jfs_txnmgr.c:1422:5-7: Unneeded variable: "rc". Return "0" on line 1527 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2021-02-08 08:46:05 -06:00
Pavel Tikhomirov	cc4a3f885e	fcntl: make F_GETOWN(EX) return 0 on dead owner task Currently there is no way to differentiate the file with alive owner from the file with dead owner but pid of the owner reused. That's why CRIU can't actually know if it needs to restore file owner or not, because if it restores owner but actual owner was dead, this can introduce unexpected signals to the "false"-owner (which reused the pid). Let's change the api, so that F_GETOWN(EX) returns 0 in case actual owner is dead already. This comports with the POSIX spec, which states that a PID of 0 indicates that no signal will be sent. Cc: Jeff Layton <jlayton@kernel.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>	2021-02-08 07:36:13 -05:00
Andreas Gruenbacher	866eef48d8	gfs2: Add trusted xattr support Add support for an additional filesystem version (sb_fs_format = 1802). When a filesystem with the new version is mounted, the filesystem supports "trusted.*" xattrs. In addition, version 1802 filesystems implement a form of forward compatibility for xattrs: when xattrs with an unknown prefix (ea_type) are found on a version 1802 filesystem, those attributes are not shown by listxattr, and they are not accessible by getxattr, setxattr, or removexattr. This mechanism might turn out to be what we need in the future, but if not, we can always bump the filesystem version and break compatibility instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Andrew Price <anprice@redhat.com>	2021-02-08 13:01:24 +01:00
Andrew Price	47b7ec1daa	gfs2: Enable rgrplvb for sb_fs_format 1802 Turn on rgrplvb by default for sb_fs_format > 1801. Mount options still have to override this so a new args field to differentiate between 'off' and 'not specified' is added, and the new default is applied only when it's not specified. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-08 13:00:36 +01:00
Eric Biggers	07c9900131	fs-verity: support reading signature with ioctl Add support for FS_VERITY_METADATA_TYPE_SIGNATURE to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the built-in signature (if present) of a verity file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. The ability for userspace to read the built-in signatures is also useful because it allows a system that is using the in-kernel signature verification to migrate to userspace signature verification. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: https://lore.kernel.org/r/20210115181819.34732-7-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:19 -08:00
Eric Biggers	947191ac8c	fs-verity: support reading descriptor with ioctl Add support for FS_VERITY_METADATA_TYPE_DESCRIPTOR to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the fs-verity descriptor of a file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. "fs-verity descriptor" here means only the part that userspace cares about because it is hashed to produce the file digest. It doesn't include the signature which ext4 and f2fs append to the fsverity_descriptor struct when storing it on-disk, since that way of storing the signature is an implementation detail. The next patch adds a separate metadata_type value for retrieving the signature separately. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: https://lore.kernel.org/r/20210115181819.34732-6-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:17 -08:00
Eric Biggers	622699cfe6	fs-verity: support reading Merkle tree with ioctl Add support for FS_VERITY_METADATA_TYPE_MERKLE_TREE to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the Merkle tree of a verity file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: https://lore.kernel.org/r/20210115181819.34732-5-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:14 -08:00
Eric Biggers	e17fe6579d	fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl Add an ioctl FS_IOC_READ_VERITY_METADATA which will allow reading verity metadata from a file that has fs-verity enabled, including: - The Merkle tree - The fsverity_descriptor (not including the signature if present) - The built-in signature, if present This ioctl has similar semantics to pread(). It is passed the type of metadata to read (one of the above three), and a buffer, offset, and size. It returns the number of bytes read or an error. Separate patches will add support for each of the above metadata types. This patch just adds the ioctl itself. This ioctl doesn't make any assumption about where the metadata is stored on-disk. It does assume the metadata is in a stable format, but that's basically already the case: - The Merkle tree and fsverity_descriptor are defined by how fs-verity file digests are computed; see the "File digest computation" section of Documentation/filesystems/fsverity.rst. Technically, the way in which the levels of the tree are ordered relative to each other wasn't previously specified, but it's logical to put the root level first. - The built-in signature is the value passed to FS_IOC_ENABLE_VERITY. This ioctl is useful because it allows writing a server program that takes a verity file and serves it to a client program, such that the client can do its own fs-verity compatible verification of the file. This only makes sense if the client doesn't trust the server and if the server needs to provide the storage for the client. More concretely, there is interest in using this ability in Android to export APK files (which are protected by fs-verity) to "protected VMs". This would use Protected KVM (https://lwn.net/Articles/836693), which provides an isolated execution environment without having to trust the traditional "host". A "guest" VM can boot from a signed image and perform specific tasks in a minimum trusted environment using files that have fs-verity enabled on the host, without trusting the host or requiring that the guest has its own trusted storage. Technically, it would be possible to duplicate the metadata and store it in separate files for serving. However, that would be less efficient and would require extra care in userspace to maintain file consistency. In addition to the above, the ability to read the built-in signatures is useful because it allows a system that is using the in-kernel signature verification to migrate to userspace signature verification. Link: https://lore.kernel.org/r/20210115181819.34732-4-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:11 -08:00
Eric Biggers	fab634c4de	fs-verity: don't pass whole descriptor to fsverity_verify_signature() Now that fsverity_get_descriptor() validates the sig_size field, fsverity_verify_signature() doesn't need to do it. Just change the prototype of fsverity_verify_signature() to take the signature directly rather than take a fsverity_descriptor. Link: https://lore.kernel.org/r/20210115181819.34732-3-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Amy Parker <enbyamy@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:09 -08:00
Eric Biggers	c2c8261151	fs-verity: factor out fsverity_get_descriptor() The FS_IOC_READ_VERITY_METADATA ioctl will need to return the fs-verity descriptor (and signature) to userspace. There are a few ways we could implement this: - Save a copy of the descriptor (and signature) in the fsverity_info struct that hangs off of the in-memory inode. However, this would waste memory since most of the time it wouldn't be needed. - Regenerate the descriptor from the merkle_tree_params in the fsverity_info. However, this wouldn't work for the signature, nor for the salt which the merkle_tree_params only contains indirectly as part of the 'hashstate'. It would also be error-prone. - Just get them from the filesystem again. The disadvantage is that in general we can't trust that they haven't been maliciously changed since the file has opened. However, the use cases for FS_IOC_READ_VERITY_METADATA don't require that it verifies the chain of trust. So this is okay as long as we do some basic validation. In preparation for implementing the third option, factor out a helper function fsverity_get_descriptor() which gets the descriptor (and appended signature) from the filesystem and does some basic validation. As part of this, start checking the sig_size field for overflow. Currently fsverity_verify_signature() does this. But the new ioctl will need this too, so do it earlier. Link: https://lore.kernel.org/r/20210115181819.34732-2-ebiggers@kernel.org Reviewed-by: Victor Hsieh <victorhsieh@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2021-02-07 14:51:03 -08:00
Linus Torvalds	825b5991a4	3 small smb3 fixes for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAeJkgACgkQiiy9cAdy T1EXAwv+IMIPxilkjjArn/36IG9pFBwMHtsQUojGf4dhUesL5DQzSraaRu2aYDPB wdNnyuLHb7UUA6khramQnAr+eQ3O1nrCzHHgGK6RQk1tDlMqTZiR51cPLF67AVqA Jg2Q+KlSzdPUea1G8iv61pnD6y7bAufEklNXk1Xbq9mPd81y3gfGi5bM6tKwy1a/ pYhHsko/2n1C6NO5d24yrKjXj2rRobY8XJ0UOHax+D5VxKfZ+5ub0sulq8UEg4ki 8BztAwkYwwU87QzkKTD8imDfAzAKyvKIQM/idrkyt1ZVkf6HdwM+EKZWbEY0G9Mv u8y+E7cjT17jTtkthm2bfaubWepWkc1STxmFiEp3Xy7+HDRc1UUGY/wgHObzaDLo P3V/G/XGCn8AJuLkpsx1iO5Cee2CHEtMCaDbCBgAnrSBxcLxqtXKOCrGXMARSRaP ylUl94Ek9QGlG7YAGklcsNO896vWMNfstRUtFZBD68SV44Pam333ZBfOJ5fv0Xuy svG0qcr8 =5FIb -----END PGP SIGNATURE----- Merge tag '5.11-rc6-smb3' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Three small smb3 fixes for stable" * tag '5.11-rc6-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: report error instead of invalid when revalidating a dentry fails smb3: fix crediting for compounding when only one request in flight smb3: Fix out-of-bounds bug in SMB2_negotiate()	2021-02-06 15:26:28 -08:00
Linus Torvalds	860b45dae9	io_uring-5.11-2021-02-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAd0KoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkPvD/kBm/uxstfomiryxDUeALadUZTIxkIsP8Zx 6IijgJXvynDJutz8gjA7aynK8j4YyrktiuS4C3ctxU+cyt3/M2ZFnnQpx88gvfK5 5OANmB1iMwUyu4GkZHsmWPo4aqv6mvE+QKKYMu8m++6/ZA4/458jx0AsjP1XSKth VYeRLElPTH+JcoxSgn9DwEJiGGViN26rpiy3NG2fNt/dXNFgwD8BevjXAYdQNscs Xrox2p2TLoMnCVWoDXg3XmMwZphigibjyWWgEEZp3LGrHU49HNUIL7GXv5No1nAO okxmVL3zropEgCXqxeJ5eGG+Ve1JCuvMxgl34dVN3qoN6AhfU7BXbGFeoKYSQIIW pgF2Qv0+KGEnRD7HOSLdygnl2gLP9ID+Xx214rKlnRE3bFkg5lwZxg4Pfos1Sn5N PGLqfvhZ8/Qb5BObW4qMobz3yG5ozrHJ8+EeccgNJuOGQtw3yHxp5NAotzTp97mA 5RCw6f9HVlTcgRnDOdYskeUfb4N1i4Ps1/0RCHGWlxOpFsVkClWeDp1DTA+/gW5l +7vREo3vpDfNW68PgWwp5y2RyfocOgRS6pRX0gDhtsLx6MJl1YKGbU0qbamdjofm bOygR+Ce4rYiG+kFHkkJcWG9rjcomy2BXCHXoylx65FimYmrFuQzdxpRO2MrWpzJ 4zQcegXM1A== =sGoQ -----END PGP SIGNATURE----- Merge tag 'io_uring-5.11-2021-02-05' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two small fixes that should go into 5.11: - task_work resource drop fix (Pavel) - identity COW fix (Xiaoguang)" * tag 'io_uring-5.11-2021-02-05' of git://git.kernel.dk/linux-block: io_uring: drop mm/files between task_work_submit io_uring: don't modify identity's files uncess identity is cowed	2021-02-06 14:37:24 -08:00
Aurelien Aptel	21b200d091	cifs: report error instead of invalid when revalidating a dentry fails Assuming - //HOST/a is mounted on /mnt - //HOST/b is mounted on /mnt/b On a slow connection, running 'df' and killing it while it's processing /mnt/b can make cifs_get_inode_info() returns -ERESTARTSYS. This triggers the following chain of events: => the dentry revalidation fail => dentry is put and released => superblock associated with the dentry is put => /mnt/b is unmounted This patch makes cifs_d_revalidate() return the error instead of 0 (invalid) when cifs_revalidate_dentry() fails, except for ENOENT (file deleted) and ESTALE (file recreated). Signed-off-by: Aurelien Aptel <aaptel@suse.com> Suggested-by: Shyam Prasad N <nspmangalore@gmail.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> CC: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2021-02-05 13:17:48 -06:00
Bob Peterson	78178ca844	gfs2: Don't skip dlm unlock if glock has an lvb Patch `fb6791d100` was designed to allow gfs2 to unmount quicker by skipping the step where it tells dlm to unlock glocks in EX with lvbs. This was done because when gfs2 unmounts a file system, it destroys the dlm lockspace shortly after it destroys the glocks so it doesn't need to unlock them all: the unlock is implied when the lockspace is destroyed by dlm. However, that patch introduced a use-after-free in dlm: as part of its normal dlm_recoverd process, it can call ls_recovery to recover dead locks. In so doing, it can call recover_rsbs which calls recover_lvb for any mastered rsbs. Func recover_lvb runs through the list of lkbs queued to the given rsb (if the glock is cached but unlocked, it will still be queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb, copies it to the rsb, thus trying to preserve the lkb. However, when gfs2 skips the dlm unlock step, it frees the glock and its lvb, which means dlm's function recover_lvb references the now freed lvb pointer, copying the freed lvb memory to the rsb. This patch changes the check in gdlm_put_lock so that it calls dlm_unlock for all glocks that contain an lvb pointer. Fixes: `fb6791d100` ("GFS2: skip dlm_unlock calls in unmount") Cc: stable@vger.kernel.org # v3.8+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-05 20:08:47 +01:00
Muchun Song	585fc0d287	mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page If a new hugetlb page is allocated during fallocate it will not be marked as active (set_page_huge_active) which will result in a later isolate_huge_page failure when the page migration code would like to move that page. Such a failure would be unexpected and wrong. Only export set_page_huge_active, just leave clear_page_huge_active as static. Because there are no external users. Link: https://lkml.kernel.org/r/20210115124942.46403-3-songmuchun@bytedance.com Fixes: `70c3547e36` (hugetlbfs: add hugetlbfs_fallocate()) Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-02-05 11:03:47 -08:00
Andreas Gruenbacher	834ec3e1ee	gfs2: Lock imbalance on error path in gfs2_recover_one In gfs2_recover_one, fix a sd_log_flush_lock imbalance when a recovery pass fails. Fixes: `c9ebc4b737` ("gfs2: allow journal replay to hold sd_log_flush_lock") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-05 18:18:57 +01:00
Pavel Begunkov	257e84a537	io_uring: refactor sendmsg/recvmsg iov managing Current iov handling with recvmsg/sendmsg may be confusing. First make a rule for msg->iov: either it points to an allocated iov that have to be kfree()'d later, or it's NULL and we use fast_iov. That's much better than current 3-state (also can point to fast_iov). And rename it into free_iov for uniformity with read/write. Also, instead of after struct io_async_msghdr copy fixing up of msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move it into io_setup_async_msg(), that's the right place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: add comment on NULL check before kfree()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:46:22 -07:00
Pavel Begunkov	5476dfed29	io_uring: clean iov usage for recvmsg buf select Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always uses fast_iov -- clean up confusing intermixing kmsg->iov and kmsg->fast_iov for buffer select. Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it to be set shortly after in io_recvmsg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Begunkov	2a7808024b	io_uring: set msg_name on msg fixup io_setup_async_msg() should fully prepare io_async_msghdr, let it also handle assigning msg_name and don't hand code it in [send,recv]msg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Shilovsky	91792bb808	smb3: fix crediting for compounding when only one request in flight Currently we try to guess if a compound request is going to succeed waiting for credits or not based on the number of requests in flight. This approach doesn't work correctly all the time because there may be only one request in flight which is going to bring multiple credits satisfying the compound request. Change the behavior to fail a request only if there are no requests in flight at all and proceed waiting for credits otherwise. Cc: <stable@vger.kernel.org> # 5.1+ Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Tom Talpey <tom@talpey.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-02-05 08:12:00 -06:00
Pavel Begunkov	aec18a57ed	io_uring: drop mm/files between task_work_submit Since SQPOLL task can be shared and so task_work entries can be a mix of them, we need to drop mm and files before trying to issue next request. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 12:42:58 -07:00
Linus Torvalds	4cb2c00c43	overlayfs fixes for 5.11-rc7 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYBuyTQAKCRDh3BK/laaZ PBBhAPwLy3ksQLhY7in4I8aKrSyWRpaCSAeLQUitxnX3eQiQnAD/S1EEIapwradV y4ou1PBRsGnhwNgArXODVCcTgqDJqw8= =GjU4 -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: - Fix capability conversion and minor overlayfs bugs that are related to the unprivileged overlay mounts introduced in this cycle. - Fix two recent (v5.10) and one old (v4.10) bug. - Clean up security xattr copy-up (related to a SELinux regression). * tag 'ovl-fixes-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: implement volatile-specific fsync error behaviour ovl: skip getxattr of security labels ovl: fix dentry leak in ovl_get_redirect ovl: avoid deadlock on directory ioctl cap: fix conversions on getxattr ovl: perform vfs_getxattr() with mounter creds ovl: add warning on user_ns mismatch	2021-02-04 10:01:17 -08:00
Darrick J. Wong	45068063ef	xfs: fix incorrect root dquot corruption error when switching group/project quota types While writing up a regression test for broken behavior when a chprojid request fails, I noticed that we were logging corruption notices about the root dquot of the group/project quota file at mount time when testing V4 filesystems. In commit `afeda6000b`, I was trying to improve ondisk dquot validation by making sure that when we load an ondisk dquot into memory on behalf of an incore dquot, the dquot id and type matches. Unfortunately, I forgot that V4 filesystems only have two quota files, and can switch that file between group and project quota types at mount time. When we perform that switch, we'll try to load the default quota limits from the root dquot prior to running quotacheck and log a corruption error when the types don't match. This is inconsequential because quotacheck will reset the second quota file as part of doing the switch, but we shouldn't leave scary messages in the kernel log. Fixes: `afeda6000b` ("xfs: validate ondisk/incore dquot flags") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2021-02-04 09:10:38 -08:00
Pavel Begunkov	5280f7e530	io_uring/io-wq: return 2-step work swap scheme Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken. Return back 2-step io-wq work exchange and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	ea64ec02b3	io_uring: deduplicate file table slot calculation Extract a helper io_fixed_file_slot() returning a place in our fixed files table, so we don't hand-code it three times in the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	847595de17	io_uring: io_import_iovec return type cleanup io_import_iovec() doesn't return IO size anymore, only error code. Make it more apparent by returning int instead of ssize and clean up leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	75c668cdd6	io_uring: treat NONBLOCK and RWF_NOWAIT similarly Make decision making of whether we need to retry read/write similar for O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and use it for all relevant checks. Also fix resubmitting NOWAIT requests via io_rw_reissue(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	b23df91bff	io_uring: highlight read-retry loop We already have implicit do-while for read-retries but with goto in the end. Convert it to an actual do-while, it highlights it so making a bit more understandable and is cleaner in general. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	5ea5dd4584	io_uring: inline io_read()'s iovec freeing io_read() has not the simpliest control flow with a lot of jumps and it's hard to read. One of those is a out_free: label, which frees iovec. However, from the middle of io_read() iovec is NULL'ed and so kfree(iovec) is no-op, it leaves us with two place where we can inline it and further clean up the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	7335e3bf9d	io_uring: don't forget to adjust io_size We have invariant in io_read() of how much we're trying to read spilled into an iter and io_size variable. The last one controls decision making about whether to do read-retries. However, io_size is modified only after the first read attempt, so if we happen to go for a third retry in a single call to io_read(), we will get io_size greater than in the iterator, so may lead to various side effects up to live-locking. Modify io_size each time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6bf985dc50	io_uring: let io_setup_async_rw take care of iovec Now we give out ownership of iovec into io_setup_async_rw(), so it either sets request's context right or frees the iovec on error itself. Makes our life a bit easier at call sites. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	1a2cc0ce8d	io_uring: further simplify do_read error parsing First, instead of checking iov_iter_count(iter) for 0 to find out that all needed bytes were read, just compare returned code against io_size. It's more reliable and arguably cleaner. Also, place the half-read case into an else branch and delete an extra label. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6713e7a614	io_uring: refactor io_read for unsupported nowait !io_file_supports_async() case of io_read() is hard to read, it jumps somewhere in the middle of the function just to do async setup and fail on a similar check. Call io_setup_async_rw() directly for this case, it's much easier to follow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	eeb60b9ab4	io_uring: refactor io_cqring_wait It's easy to make a mistake in io_cqring_wait() because for all break/continue clauses we need to watch for prepare/finish_wait to be used correctly. Extract all those into a new helper io_cqring_wait_schedule(), and transforming the loop into simple series of func calls: prepare(); check_and_schedule(); finish(); Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	c1d5a22468	io_uring: refactor scheduling in io_cqring_wait schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to work just as schedule(), so instead of hand-coding it based on arguments always use the timeout version and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	9936c7c2bc	io_uring: deduplicate core cancellations sequence Files and task cancellations go over same steps trying to cancel requests in io-wq, poll, etc. Deduplicate it with a helper. note: new io_uring_try_cancel_requests() is former __io_uring_cancel_task_requests() with files passed as an agrument and flushing overflowed requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Xiaoguang Wang	d7e10d4769	io_uring: don't modify identity's files uncess identity is cowed Abaci Robot reported following panic: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 800000010ef3f067 P4D 800000010ef3f067 PUD 10d9df067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 1869 Comm: io_wqe_worker-0 Not tainted 5.11.0-rc3+ #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:put_files_struct+0x1b/0x120 Code: 24 18 c7 00 f4 ff ff ff e9 4d fd ff ff 66 90 0f 1f 44 00 00 41 57 41 56 49 89 fe 41 55 41 54 55 53 48 83 ec 08 e8 b5 6b db ff 41 ff 0e 74 13 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f e9 9c RSP: 0000:ffffc90002147d48 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88810d9a5300 RCX: 0000000000000000 RDX: ffff88810d87c280 RSI: ffffffff8144ba6b RDI: 0000000000000000 RBP: 0000000000000080 R08: 0000000000000001 R09: ffffffff81431500 R10: ffff8881001be000 R11: 0000000000000000 R12: ffff88810ac2f800 R13: ffff88810af38a00 R14: 0000000000000000 R15: ffff8881057130c0 FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010dbaa002 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __io_clean_op+0x10c/0x2a0 io_dismantle_req+0x3c7/0x600 __io_free_req+0x34/0x280 io_put_req+0x63/0xb0 io_worker_handle_work+0x60e/0x830 ? io_wqe_worker+0x135/0x520 io_wqe_worker+0x158/0x520 ? __kthread_parkme+0x96/0xc0 ? io_worker_handle_work+0x830/0x830 kthread+0x134/0x180 ? kthread_create_worker_on_cpu+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: CR2: 0000000000000000 ---[ end trace c358ca86af95b1e7 ]--- I guess case below can trigger above panic: there're two threads which operates different io_uring ctxs and share same sqthread identity, and later one thread exits, io_uring_cancel_task_requests() will clear task->io_uring->identity->files to be NULL in sqpoll mode, then another ctx that uses same identity will panic. Indeed we don't need to clear task->io_uring->identity->files here, io_grab_identity() should handle identity->files changes well, if task->io_uring->identity->files is not equal to current->files, io_cow_identity() should handle this changes well. Cc: stable@vger.kernel.org # 5.5+ Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 07:43:21 -07:00
Theodore Ts'o	b5776e7524	ext4: fix potential htree index checksum corruption In the case where we need to do an interior node split, and immediately afterwards, we are unable to allocate a new directory leaf block due to ENOSPC, the directory index checksum's will not be filled in correctly (and indeed, will not be correctly journalled). This looks like a bug that was introduced when we added largedir support. The original code doesn't make any sense (and should have been caught in code review), but it was hidden because most of the time, the index node checksum will be set by do_split(). But if do_split bails out due to ENOSPC, then ext4_handle_dirty_dx_node() won't get called, and so the directory index checksum field will not get set, leading to: EXT4-fs error (device sdb): dx_probe:858: inode #6635543: block 4022: comm nfsd: Directory index failed checksum Google-Bug-Id: 176345532 Fixes: `e08ac99fa2` ("ext4: add largedir feature") Cc: Artem Blagodarenko <artem.blagodarenko@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-04 00:05:20 -05:00
Daeho Jeong	e659206617	f2fs: add ckpt_thread_ioprio sysfs node Added "ckpt_thread_ioprio" sysfs node to give a way to change checkpoint merge daemon's io priority. Its default value is "be,3", which means "BE" I/O class and I/O priority "3". We can select the class between "rt" and "be", and set the I/O priority within valid range of it. "," delimiter is necessary in between I/O class and priority number. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-03 13:03:58 -08:00
Daeho Jeong	261eeb9c15	f2fs: introduce checkpoint_merge mount option We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge", which creates a kernel daemon and makes it to merge concurrent checkpoint requests as much as possible to eliminate redundant checkpoint issues. Plus, we can eliminate the sluggish issue caused by slow checkpoint operation when the checkpoint is done in a process context in a cgroup having low i/o budget and cpu shares. To make this do better, we set the default i/o priority of the kernel daemon to "3", to give one higher priority than other kernel threads. The below verification result explains this. The basic idea has come from https://opensource.samsung.com. [Verification] Android Pixel Device(ARM64, 7GB RAM, 256GB UFS) Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20) Set "strict_guarantees" to "1" in BFQ tunables In "fg" cgroup, - thread A => trigger 1000 checkpoint operations "for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1; done" - thread B => gererating async. I/O "fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1 --filename=test_img --name=test" In "bg" cgroup, - thread C => trigger repeated checkpoint operations "echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file; fsync test_dir2; done" We've measured thread A's execution time. [ w/o patch ] Elapsed Time: Avg. 68 seconds [ w/ patch ] Elapsed Time: Avg. 48 seconds Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan] Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-03 13:03:06 -08:00
Christoph Hellwig	f69e8091c4	xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl The mp variable in xfs_file_compat_ioctl is only used when BROKEN_X86_ALIGNMENT is define. Remove it and just open code the dereference in a few places. Link: https://lore.kernel.org/r/20210203173009.462205-1-christian.brauner@ubuntu.com Fixes: `f736d93d76` ("xfs: support idmapped mounts") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-02-03 20:31:14 +01:00
BingJing Chang	3a9a3aa805	udf: handle large user and group ID If uid or gid of mount options is larger than INT_MAX, udf_fill_super will return -EINVAL. The problem can be encountered by a domain user or reproduced via: mount -o loop,uid=2147483648 something-in-udf-format.iso /mnt This can be fixed as commit `233a01fa9c` ("fuse: handle large user and group ID"). Link: https://lore.kernel.org/r/20210129045502.10546-1-bingjingc@synology.com Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-02-03 19:05:54 +01:00
BingJing Chang	a0b3cb71a1	isofs: handle large user and group ID If uid or gid of mount options is larger than INT_MAX, isofs_fill_super will return -EINVAL. The problem can be encountered by a domain user or reproduced via: mount -o loop,uid=2147483648 ubuntu-16.04.6-server-amd64.iso /mnt This can be fixed as commit `233a01fa9c` ("fuse: handle large user and group ID"). Link: https://lore.kernel.org/r/20210129045315.10375-1-bingjingc@synology.com Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Jan Kara <jack@suse.cz>	2021-02-03 19:05:52 +01:00
Andreas Gruenbacher	76fce65489	gfs2: Move function gfs2_ail_empty_tr Move this function further up in log.c so that we can use it in the next patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:25 +01:00
Andreas Gruenbacher	5cb738b5fb	gfs2: Get rid of current_tail() Keep the current value of the updated log tail in the super block as sb_log_flush_tail instead of computing it on the fly. This avoids unnecessary sd_ail_lock taking and cleans up the code. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:25 +01:00
Andreas Gruenbacher	297de3180d	gfs2: Use a tighter bound in gfs2_trans_begin Use a tighter bound for the number of blocks required by transactions in gfs2_trans_begin: in the worst case, we'll have mixed data and metadata, so we'll need a log desciptor for each type. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:25 +01:00
Andreas Gruenbacher	5ae8fff8d0	gfs2: Clean up gfs2_log_reserve Wake up log waiters in gfs2_log_release when log space has actually become available. This is a much better place for the wakeup than gfs2_logd. Check if enough log space is immeditely available before anything else. If there isn't, use io_wait_event to wait instead of open-coding it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:24 +01:00
Andreas Gruenbacher	4a3d049db4	gfs2: Don't wait for journal flush in clean_journal Commit `588bff95c9` added gfs2_write_log_header() and started using it in clean_journal(), with an additional call to log_flush_wait() at the end of gfs2_write_log_header() which is unnecessary for clean_journal(). Move that call out of gfs2_write_log_header() to restore the previous behavior. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:24 +01:00
Andreas Gruenbacher	c1eba1b0bc	gfs2: Move lock flush locking to gfs2_trans_{begin,end} Move the read locking of sd_log_flush_lock from gfs2_log_reserve to gfs2_trans_begin, and its unlocking from gfs2_log_release to gfs2_trans_end. Use gfs2_log_release in two places in which it was open coded before. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:24 +01:00
Andreas Gruenbacher	f3708fb59f	gfs2: Get rid of sd_reserving_log This counter and the associated wait queue are only used so that gfs2_make_fs_ro can efficiently wait for all pending log space allocations to fail after setting the filesystem to read-only. This comes at the cost of waking up that wait queue very frequently. Instead, when gfs2_log_reserve fails because the filesystem has become read-only, Wake up sd_log_waitq. In gfs2_make_fs_ro, set the file system read-only and then wait until all the log space has been released. Give up and report the problem after a while. With that, sd_reserving_log and sd_reserving_log_wait can be removed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:24 +01:00
Andreas Gruenbacher	c968f5788b	gfs2: Clean up on-stack transactions Replace the TR_ALLOCED flag by its inverse, TR_ONSTACK: that way, the flag only needs to be set in the exceptional case of on-stack transactions. Split off __gfs2_trans_begin from gfs2_trans_begin and use it to replace the open-coded version in gfs2_ail_empty_gl. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 18:37:10 +01:00
Gao Xiang	07aabd9c4a	xfs: get rid of xfs_growfs_{data,log}_t Such usage isn't encouraged by the kernel coding style. Leave the definitions alone in case of userspace users. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-03 09:18:50 -08:00
Gao Xiang	ce5e1062e2	xfs: rename `new' to `delta' in xfs_growfs_data_private() It actually means the delta block count of growfs. Rename it in order to make it clear. Also introduce nb_div to avoid reusing `delta`. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-03 09:18:50 -08:00
Zorro Lang	bc41fa5321	libxfs: expose inobtcount in xfs geometry As xfs supports the feature of inode btree block counters now, expose this feature flag in xfs geometry, for userspace can check if the inobtcnt is enabled or not. Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-03 09:18:50 -08:00
Darrick J. Wong	0fa4a10a2f	xfs: don't bounce the iolock between free_{eof,cow}blocks Since xfs_inode_free_eofblocks and xfs_inode_free_cowblocks are now internal static functions, we can save ourselves a cycling of the iolock by passing the lock state out to xfs_blockgc_scan_inode and letting it do all the unlocking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:50 -08:00
Darrick J. Wong	47bd6d3457	xfs: expose the blockgc workqueue knobs publicly Expose the workqueue sysfs knobs for the speculative preallocation gc workers on all kernels, and update the sysadmin information. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:50 -08:00
Darrick J. Wong	894ecacf0f	xfs: parallelize block preallocation garbage collection Split the block preallocation garbage collection work into per-AG work items so that we can take advantage of parallelization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:50 -08:00
Darrick J. Wong	c9a6526fe7	xfs: rename block gc start and stop functions Shorten the names of the two functions that start and stop block preallocation garbage collection and move them up to the other blockgc functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:50 -08:00
Darrick J. Wong	419567534e	xfs: only walk the incore inode tree once per blockgc scan Perform background block preallocation gc scans more efficiently by walking the incore inode tree once. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	9669f51de5	xfs: consolidate the eofblocks and cowblocks workers Remove the separate cowblocks work items and knob so that we can control and run everything from a single blockgc work queue. Note that the speculative_prealloc_lifetime sysfs knob retains its historical name even though the functions move to prefix xfs_blockgc_*. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	ce2d3bbe06	xfs: consolidate incore inode radix tree posteof/cowblocks tags The clearing of posteof blocks and cowblocks serve the same purpose: removing speculative block preallocations from inactive files. We don't need to burn two radix tree tags on this, so combine them into one. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	865ac8e253	xfs: remove trivial eof/cowblocks functions Get rid of these trivial helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	b943c0cd56	xfs: hide xfs_icache_free_cowblocks Change the one remaining caller of xfs_icache_free_cowblocks to use our new combined blockgc scan function instead, since we will soon be combining the two scans. This introduces a slight behavior change, since a readonly remount now clears out post-EOF preallocations and not just CoW staging extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	0461a320e3	xfs: hide xfs_icache_free_eofblocks Change the one remaining caller of xfs_icache_free_eofblocks to use our new combined blockgc scan function instead, since we will soon be combining the two scans. This introduces a slight behavior change, since the XFS_IOC_FREE_EOFBLOCKS now clears out speculative CoW reservations in addition to post-eof blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	f929656983	xfs: relocate the eofb/cowb workqueue functions Move the xfs_{eof,cow}blocks_worker and xfs_queue_{eof,cow}blocks functions further down in the file so that the cleanups in the next patches won't have to pre-declare static functions. No functional changes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	05a302a170	xfs: set WQ_SYSFS on all workqueues in debug mode When CONFIG_XFS_DEBUG=y, set WQ_SYSFS on all workqueues that we create so that we (developers) have a means to monitor cpu affinity and whatnot for background workers. In the next patchset we'll expose knobs for more of the workqueues publicly and document it, but not now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	f83d436aef	xfs: increase the default parallelism levels of pwork clients Increase the parallelism level for pwork clients to the workqueue defaults so that we can take advantage of computers with a lot of CPUs and a lot of hardware. On fast systems this will speed up quotacheck by a large factor, and the following posteof/cowblocks cleanup series will use the functionality presented in this patch to run garbage collection as quickly as possible. We do this by switching the pwork workqueue to unbounded, since the current user (quotacheck) runs lengthy scans for each work item and we don't care about dispatching the work on a warm cpu cache or anything like that. Also set WQ_SYSFS so that we can monitor where the wq is running. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	a1a7d05a05	xfs: flush speculative space allocations when we run out of space If a fs modification (creation, file write, reflink, etc.) is unable to reserve enough space to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	85c5b27075	xfs: refactor xfs_icache_free_{eof,cow}blocks call sites In anticipation of more restructuring of the eof/cowblocks gc code, refactor calling of those two functions into a single internal helper function, then present a new standard interface to purge speculative block preallocations and start shifting higher level code to use that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	38899f8099	xfs: add a tracepoint for blockgc scans Add some tracepoints so that we can observe when the speculative preallocation garbage collector runs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	758303d144	xfs: flush eof/cowblocks if we can't reserve quota for chown If a file user, group, or project change is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	c237dd7c70	xfs: flush eof/cowblocks if we can't reserve quota for inode creation If an inode creation is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	766aabd599	xfs: flush eof/cowblocks if we can't reserve quota for file blocks If a fs modification (data write, reflink, xattr set, fallocate, etc.) is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	4ca7420568	xfs: try worst case space reservation upfront in xfs_reflink_remap_extent Now that we've converted xfs_reflink_remap_extent to use the new xfs_trans_alloc_inode API, we can focus on its slightly unusual behavior with regard to quota reservations. Since it's valid to remap written blocks into a hole, we must be able to increase the quota count by the number of blocks in the mapping. However, the incore space reservation process requires us to supply an asymptotic guess before we can gain exclusive access to resources. We'd like to reserve all the quota we need up front, but we also don't want to fail a written -> allocated remap operation unnecessarily. The solution is to make the remap_extents function call the transaction allocation function twice. The first time we ask to reserve enough space and quota to handle the absolute worst case situation, but if that fails, we can fall back to the old strategy: ask for the bare minimum space reservation upfront and increase the quota reservation later if we need to. Later in this patchset we change the transaction and quota code to try to reclaim space if we cannot reserve free space or quota. Restructuring the remap_extent function in this manner means that if the fallback increase fails, we can pass that back to the caller knowing that the transaction allocation already tried freeing space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	111068f80e	xfs: pass flags and return gc errors from xfs_blockgc_free_quota Change the signature of xfs_blockgc_free_quota in preparation for the next few patches. Callers can now pass EOF_FLAGS into the function to control scan parameters; and the function will now pass back any corruption errors seen while scanning, though for our retry loops we'll just try again unconditionally. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	3d4feec006	xfs: move and rename xfs_inode_free_quota_blocks to avoid conflicts Move this function further down in the file so that later cleanups won't have to declare static functions. Change the name because we're about to rework all the code that performs garbage collection of speculatively allocated file blocks. No functional changes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	9a537de3b0	xfs: xfs_inode_free_quota_blocks should scan project quota Buffered writers who have run out of quota reservation call xfs_inode_free_quota_blocks to try to free any space reservations that might reduce the quota usage. Unfortunately, the buffered write path treats "out of project quota" the same as "out of overall space" so this function has never supported scanning for space that might ease an "out of project quota" condition. We're about to start using this function for cases where we actually /can/ tell if we're out of project quota, so add in this functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	f41a0716f4	xfs: don't stall cowblocks scan if we can't take locks Don't stall the cowblocks scan on a locked inode if we possibly can. We'd much rather the background scanner keep moving. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	a636b1d1cf	xfs: trigger all block gc scans when low on quota space The functions to run an eof/cowblocks scan to try to reduce quota usage are kind of a mess -- the logic repeatedly initializes an eofb structure and there are logic bugs in the code that result in the cowblocks scan never actually happening. Replace all three functions with a single function that fills out an eofb and runs both eof and cowblocks scans. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	2a4bdfa855	xfs: shut down the filesystem if we screw up quota reservation If we ever screw up the quota reservations enough to trip the assertions, something's wrong with the quota code. Shut down the filesystem when this happens, because this is corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	fea7aae6ce	xfs: rename code to error in xfs_ioctl_setattr Rename the 'code' variable to 'error' to follow the naming convention of most other functions in xfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	5c615f0feb	xfs: remove xfs_qm_vop_chown_reserve Now that the only caller of this function is xfs_trans_alloc_ichange, just open-code the meat of _chown_reserve in that caller. Drop the (redundant) [ugp]id checks because xfs has a 1:1 relationship between quota ids and incore dquots. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	7317a03df7	xfs: refactor inode ownership change transaction/inode/quota allocation idiom For file ownership (uid, gid, prid) changes, create a new helper xfs_trans_alloc_ichange that allocates a transaction and reserves the appropriate amount of quota against that transction in preparation for a change of user, group, or project id. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for ichange transactions slightly. Since tr_ichange does not have a permanent reservation and cannot roll, we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked automatically at commit time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	f2f7b9ff62	xfs: refactor inode creation transaction/inode/quota allocation idiom For file creation, create a new helper xfs_trans_alloc_icreate that allocates a transaction and reserves the appropriate amount of quota against that transction. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for non-tempfile creation slightly, in that we now make the quota reservation without holding the directory ILOCK. While the dquots chosen for inode creation are based on the directory state at a given point in time, the directory ILOCK was released as soon as the dquot references are picked up. Hence it was never necessary to hold the directory ILOCK for the quota reservation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	f273387b04	xfs: refactor reflink functions to use xfs_trans_alloc_inode The two remaining callers of xfs_trans_reserve_quota_nblks are in the reflink code. These conversions aren't as uniform as the previous conversions, so call that out in a separate patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	3de4eb106f	xfs: allow reservation of rtblocks with xfs_trans_alloc_inode Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode wrapper function, then convert a few more callsites. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	3a1af6c317	xfs: refactor common transaction/inode/quota allocation idiom Create a new helper xfs_trans_alloc_inode that allocates a transaction, locks and joins an inode to it, and then reserves the appropriate amount of quota against that transction. Then replace all the open-coded idioms with a single call to this helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	02b7ee4eb6	xfs: reserve data and rt quota at the same time Modify xfs_trans_reserve_quota_nblks so that we can reserve data and realtime blocks from the dquot at the same time. This change has the theoretical side effect that for allocations to realtime files we will reserve from the dquot both the number of rtblocks being allocated and the number of bmbt blocks that might be needed to add the mapping. However, since the mount code disables quota if it finds a realtime device, this should not result in any behavior changes. Now that we've moved the inode creation callers away from using the _nblks function, we can repurpose the (now unused) ninos argument for realtime blocks, so make that change. This also replaces the flags argument with a boolean parameter to force the reservation since we don't need to distinguish between data and rt quota reservations any more, and the only flag being passed in was FORCE_RES. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	7ac6eb46c9	xfs: fix up build warnings when quotas are disabled Fix some build warnings on gcc 10.2 when quotas are disabled. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	ad4a747397	xfs: clean up icreate quota reservation calls Create a proper helper so that inode creation calls can reserve quota with a dedicated function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	35b1101099	xfs: remove xfs_trans_unreserve_quota_nblks completely xfs_trans_cancel will release all the quota resources that were reserved on behalf of the transaction, so get rid of the explicit unreserve step. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	8554650003	xfs: create convenience wrappers for incore quota block reservations Create a couple of convenience wrappers for creating and deleting quota block reservations against future changes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	4abe21ad67	xfs: clean up quota reservation callsites Convert a few xfs_trans_reserve callsites that are open-coding other convenience functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:49 -08:00
Darrick J. Wong	b8055ed677	xfs: reduce quota reservation when doing a dax unwritten extent conversion In commit `3b0fe47805`, we reduced the free space requirement to perform a pre-write unwritten extent conversion on an S_DAX file. Since we're not actually allocating any space, the logic goes, we only need enough reservation to handle shape changes in the bmbt. The same logic should have been applied to quota -- we're not allocating any space, so we only need to reserve enough quota to handle the bmbt shape changes. Fixes: `3b0fe47805` ("xfs: Don't use reserved blocks for data blocks with DAX") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:18:48 -08:00
Darrick J. Wong	1aecf3734a	xfs: fix chown leaking delalloc quota blocks when fssetxattr fails While refactoring the quota code to create a function to allocate inode change transactions, I noticed that xfs_qm_vop_chown_reserve does more than just make reservations: it also modifies the incore counts directly to handle the owner id change for the delalloc blocks. I then observed that the fssetxattr code continues validating input arguments after making the quota reservation but before dirtying the transaction. If the routine decides to error out, it fails to undo the accounting switch! This leads to incorrect quota reservation and failure down the line. We can fix this by making the reservation function do only that -- for the new dquot, it reserves ondisk and delalloc blocks to the transaction, and the old dquot hangs on to its incore reservation for now. Once we actually switch the dquots, we can then update the incore reservations because we've dirtied the transaction and it's too late to turn back now. No fixes tag because this has been broken since the start of git. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2021-02-03 09:17:47 -08:00
Andreas Gruenbacher	15e20a301a	gfs2: Use sb_start_intwrite in gfs2_ail_empty_gl Commit `2e60d7683c` ("GFS2: update freeze code to use freeze/thaw_super on all nodes") optimized away the sb_start_intwrite ... sb_end_intwrite protection for the on-stack transactions in gfs2_ail_empty_gl with no explanation. I can't think of a valid reason for doing that, so revert that change. This simplifies the next commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2021-02-03 16:33:16 +01:00
Vinicius Tinti	c6c818e50d	ext4: factor out htree rep invariant check This patch moves some debugging code which is used to validate the hash tree node when doing a binary search of an htree node into a separate function, which is disabled by default (since it is only used by developers when they are modifying the htree code paths). In addition to cleaning up the code to make it more maintainable, it silences a Clang compiler warning when -Wunreachable-code-aggressive is enabled. (There is no plan to enable this warning by default, since there it has far too many false positives; nevertheless, this commit reduces one of the many false positives by one.) Signed-off-by: Vinicius Tinti <viniciustinti@gmail.com> Link: https://lore.kernel.org/r/20210202162837.129631-1-viniciustinti@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-03 00:34:02 -05:00
Daejun Park	96e7c02d0b	ext4: Change list_for_each* to list_for_each_entry* In the fast_commit.c, list_for_each* + list_entry can be changed to list_for_each_entry*. It reduces number of variables and lines. Signed-off-by: Daejun Park <daejun7.park@samsung.com> Link: https://lore.kernel.org/r/20210111013726epcms2p4579ae56040d7043db785bf0d0a785dc7@epcms2p4 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-03 00:05:01 -05:00
Theodore Ts'o	027f14f535	ext4: don't try to processed freed blocks until mballoc is initialized If we try to make any changes via the journal between when the journal is initialized, but before the multi-block allocated is initialized, we will end up deferencing a NULL pointer when the journal commit callback function calls ext4_process_freed_data(). The proximate cause of this failure was commit `2d01ddc866` ("ext4: save error info to sb through journal if available") since file system corruption problems detected before the call to ext4_mb_init() would result in a journal commit before we aborted the mount of the file system.... and we would then trigger the NULL pointer deref. Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.edu Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-02 23:16:01 -05:00
Zheng Yongjun	59ebc7fd74	ext4: use DEFINE_MUTEX() for mutex lock mutex lock can be initialized automatically with DEFINE_MUTEX() rather than explicitly calling mutex_init(). Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Link: https://lore.kernel.org/r/20201224132244.30907-1-zhengyongjun3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2021-02-02 23:16:01 -05:00
Linus Torvalds	a992562872	Networking fixes for 5.11-rc7, including fixes from bpf and mac80211 trees. Current release - regressions: - ip_tunnel: fix mtu calculation - mlx5: fix function calculation for page trees Previous releases - regressions: - vsock: fix the race conditions in multi-transport support - neighbour: prevent a dead entry from updating gc_list - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add Previous releases - always broken: - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF cgroup getsockopt infra when user space is trying to race against optlen, from Loris Reiff. - bpf: add missing fput() in BPF inode storage map update helper - udp: ipv4: manipulate network header of NATed UDP GRO fraglist - mac80211: fix station rate table updates on assoc - r8169: work around RTL8125 UDP HW bug - igc: report speed and duplex as unknown when device is runtime suspended - rxrpc: fix deadlock around release of dst cached on udp tunnel Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAZjwQACgkQMUZtbf5S IruLbQ//Yg9+xEnqhDuOJZtYHB0rsJjLlKmtvgOsBr8BaTcUEPoPoqUPm+EMvCHb o1fFa1qIrbS5luVEofu9hNX7DGXwvgawaMW2TympJhqLZQqjazCMB/st99LphhJw RvaZI8aDOikosT4c+I0vm83jDQETonrjziIcPfHHPjn/Q+amGRRRXiTSQnRF/MlU oARCG+U3kHsHBDUPNSCtSjKXshoZPjFb/pD7fQAlzzm7CssvbPhNWbducueyP2Fb XW4RwJu9QBBH2JS6uZJ1Y6LVoRzusmE9dUam3KhkiL/CHs72lWPsc+Rn5gbBPvc5 Y4T4h61Xti1O4ULKdqhGceror6XY+4Qb1VlHWWztOhIo00wIAv3IHbTup/4o0HBr j84MtcyOl/qxSFXjunPJkbWJngXikrkIMS0Bl6ZcPAejYM9wN6vCgbvFCHbEg1Rx cWFnYyS9FCLduaxHSizv050tWhknOdX+zHK3fOtlW0yWnreJAB8Hoc21Zm7YKvg0 GxxcGK6AhqJ6s2ixVDv7MyJrltJ/hOJQb+T3HgHFuY2BYUs8F2r/HoHU/u4uCl76 RdBzbC/sLnBpMHf6r1rHTnGPsapoJOOYWnej71l425vX1qr5xnmxVNNB6HReObNv +/jPoRYa5BVsVt2LmDcuH1O32pXJPWKVBR7Yfa6Bn2yzhcbECTc= =ZByM -----END PGP SIGNATURE----- Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.11-rc7, including fixes from bpf and mac80211 trees. Current release - regressions: - ip_tunnel: fix mtu calculation - mlx5: fix function calculation for page trees Previous releases - regressions: - vsock: fix the race conditions in multi-transport support - neighbour: prevent a dead entry from updating gc_list - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add Previous releases - always broken: - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF cgroup getsockopt infra when user space is trying to race against optlen, from Loris Reiff. - bpf: add missing fput() in BPF inode storage map update helper - udp: ipv4: manipulate network header of NATed UDP GRO fraglist - mac80211: fix station rate table updates on assoc - r8169: work around RTL8125 UDP HW bug - igc: report speed and duplex as unknown when device is runtime suspended - rxrpc: fix deadlock around release of dst cached on udp tunnel" * tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits) net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary net: ipa: fix two format specifier errors net: ipa: use the right accessor in ipa_endpoint_status_skip() net: ipa: be explicit about endianness net: ipa: add a missing __iomem attribute net: ipa: pass correct dma_handle to dma_free_coherent() r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS net: mvpp2: TCAM entry enable should be written after SRAM data net: lapb: Copy the skb before sending a packet net/mlx5e: Release skb in case of failure in tc update skb net/mlx5e: Update max_opened_tc also when channels are closed net/mlx5: Fix leak upon failure of rule creation net/mlx5: Fix function calculation for page trees docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc udp: ipv4: manipulate network header of NATed UDP GRO fraglist net: ip_tunnel: fix mtu calculation vsock: fix the race conditions in multi-transport support net: sched: replaced invalid qdisc tree flush helper in qdisc_replace ibmvnic: device remove has higher precedence over reset ...	2021-02-02 10:26:09 -08:00
Chao Yu	c8e43d55b1	f2fs: relocate inline conversion from mmap() to mkwrite() If there is page fault only for read case on inline inode, we don't need to convert inline inode, instead, let's do conversion for write case. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-02 09:14:27 -08:00
Dehe Gu	39f71b7e40	f2fs: fix a wrong condition in __submit_bio We should use !F2FS_IO_ALIGNED() to check and submit_io directly. Fixes: `8223ecc456` ("f2fs: fix to add missing F2FS_IO_ALIGNED() condition") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Dehe Gu <gudehe@huawei.com> Signed-off-by: Ge Qiu <qiuge@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-02 09:14:27 -08:00
Gustavo A. R. Silva	8d8d1dbefc	smb3: Fix out-of-bounds bug in SMB2_negotiate() While addressing some warnings generated by -Warray-bounds, I found this bug that was introduced back in 2017: CC [M] fs/cifs/smb2pdu.o fs/cifs/smb2pdu.c: In function ‘SMB2_negotiate’: fs/cifs/smb2pdu.c:822:16: warning: array subscript 1 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 822 \| req->Dialects[1] = cpu_to_le16(SMB30_PROT_ID); \| ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:823:16: warning: array subscript 2 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 823 \| req->Dialects[2] = cpu_to_le16(SMB302_PROT_ID); \| ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:824:16: warning: array subscript 3 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 824 \| req->Dialects[3] = cpu_to_le16(SMB311_PROT_ID); \| ~~~~~~~~~~~~~^~~ fs/cifs/smb2pdu.c:816:16: warning: array subscript 1 is above array bounds of ‘__le16[1]’ {aka ‘short unsigned int[1]’} [-Warray-bounds] 816 \| req->Dialects[1] = cpu_to_le16(SMB302_PROT_ID); \| ~~~~~~~~~~~~~^~~ At the time, the size of array _Dialects_ was changed from 1 to 3 in struct validate_negotiate_info_req, and then in 2019 it was changed from 3 to 4, but those changes were never made in struct smb2_negotiate_req, which has led to a 3 and a half years old out-of-bounds bug in function SMB2_negotiate() (fs/cifs/smb2pdu.c). Fix this by increasing the size of array _Dialects_ in struct smb2_negotiate_req to 4. Fixes: `9764c02fcb` ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)") Fixes: `d5c7076b77` ("smb3: add smb3.1.1 to default dialect list") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2021-02-01 22:43:39 -06:00
Liu Song	2e0cd472a0	f2fs: remove unnecessary initialization in xattr.c These variables will be explicitly assigned before use, so there is no need to initialize. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-01 14:32:51 -08:00
Yi Chen	25fb04dbce	f2fs: fix to avoid inconsistent quota data Occasionally, quota data may be corrupted detected by fsck: Info: checkpoint state = 45 : crc compacted_summary unmount [QUOTA WARNING] Usage inconsistent for ID 0:actual (1543036928, 762) != expected (1543032832, 762) [ASSERT] (fsck_chk_quota_files:1986) --> Quota file is missing or invalid quota file content found. [QUOTA WARNING] Usage inconsistent for ID 0:actual (1352478720, 344) != expected (1352474624, 344) [ASSERT] (fsck_chk_quota_files:1986) --> Quota file is missing or invalid quota file content found. [FSCK] Unreachable nat entries [Ok..] [0x0] [FSCK] SIT valid block bitmap checking [Ok..] [FSCK] Hard link checking for regular file [Ok..] [0x0] [FSCK] valid_block_count matching with CP [Ok..] [0xdf299] [FSCK] valid_node_count matcing with CP (de lookup) [Ok..] [0x2b01] [FSCK] valid_node_count matcing with CP (nat lookup) [Ok..] [0x2b01] [FSCK] valid_inode_count matched with CP [Ok..] [0x2665] [FSCK] free segment_count matched with CP [Ok..] [0xcb04] [FSCK] next block offset is free [Ok..] [FSCK] fixing SIT types [FSCK] other corrupted bugs [Fail] The root cause is: If we open file w/ readonly flag, disk quota info won't be initialized for this file, however, following mmap() will force to convert inline inode via f2fs_convert_inline_inode(), which may increase block usage for this inode w/o updating quota data, it causes inconsistent disk quota info. The issue will happen in following stack: open(file, O_RDONLY) mmap(file) - f2fs_convert_inline_inode - f2fs_convert_inline_page - f2fs_reserve_block - f2fs_reserve_new_block - f2fs_reserve_new_blocks - f2fs_i_blocks_write - dquot_claim_block inode->i_blocks increase, but the dqb_curspace keep the size for the dquots is NULL. To fix this issue, let's call dquot_initialize() anyway in both f2fs_truncate() and f2fs_convert_inline_inode() functions to avoid potential inconsistent quota data issue. Fixes: `0abd675e97` ("f2fs: support plain user/group quota") Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com> Signed-off-by: Dehe Gu <gudehe@huawei.com> Signed-off-by: Junchao Jiang <jiangjunchao1@huawei.com> Signed-off-by: Ge Qiu <qiuge@huawei.com> Signed-off-by: Yi Chen <chenyi77@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-01 14:32:51 -08:00
Jaegeuk Kim	b0ff4fe746	f2fs: flush data when enabling checkpoint back During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as sync and fsync. So, when enabling it back, we must flush all of them in order to keep the data persistent. Otherwise, suddern power-cut right after enabling checkpoint will cause data loss. Fixes: `4354994f09` ("f2fs: checkpoint disabling") Cc: stable@vger.kernel.org Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2021-02-01 14:32:50 -08:00
Pavel Begunkov	57cd657b82	io_uring: simplify do_read return parsing do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important enough of a case to prioritise it. Fold it into ret < 0 check, so we get rid of an extra if and make it a bit more readable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ce3d5aae33	io_uring: deduplicate adding to REQ_F_INFLIGHT We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to extract a helper for marking requests as so. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	e86d004729	io_uring: remove work flags after cleanup Shouldn't be a problem now, but it's better to clean REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are killed, so cancellation see them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	34e08fed2c	io_uring: inline io_req_drop_files() req->files now have same lifetime as all other iowq-work resources, inline io_req_drop_files() for consistency. Moreover, since REQ_F_INFLIGHT is no more files specific, the function name became very confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ba13e23f37	io_uring: kill not used needs_file_no_error We have no request types left using needs_file_no_error, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	9ae1f8dd37	io_uring: fix inconsistent lock state WARNING: inconsistent lock state inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock); * DEADLOCK * 1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442 stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions). Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Dan Carpenter	13770a71ed	io_uring: Fix NULL dereference in error in io_sqe_files_register() If we hit a "goto out_free;" before the "ctx->file_data" pointer has been assigned then it leads to a NULL derefence when we call: free_fixed_rsrc_data(ctx->file_data); We can fix this by moving the assignment earlier. Fixes: `1ad555c6ae` ("io_uring: create common fixed_rsrc_data allocation routines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 11:56:41 -07:00
Dave Chinner	ed1128c2d0	xfs: reduce exclusive locking on unaligned dio Attempt shared locking for unaligned DIO, but only if the the underlying extent is already allocated and in written state. On failure, retry with the existing exclusive locking. Test case is fio randrw of 512 byte IOs using AIO and an iodepth of 32 IOs. Vanilla: READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec Patched: READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec That's an improvement from ~18k IOPS to a ~150k IOPS, which is about the IOPS limit of the VM block device setup I'm testing on. 4kB block IO comparison: READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec Which is ~150k IOPS, same as what the test gets for sub-block AIO+DIO writes with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: rebased, split unaligned from nowait] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-01 09:47:19 -08:00
Dave Chinner	caa89dbc43	xfs: split the unaligned DIO write code out The unaligned DIO write path is more convolted than the normal path, and we are about to make it more complex. Keep the block aligned fast path dio write code trim and simple by splitting out the unaligned DIO code from it. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: rebased, fixed a few minor nits] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-01 09:47:19 -08:00
Christoph Hellwig	896f72d067	xfs: improve the reflink_bounce_dio_write tracepoint Use a more suitable event class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2021-02-01 09:47:19 -08:00

... 3 4 5 6 7 ...

69269 Коммитов