WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Logan Gunthorpe	b13015af94	md/raid5-cache: Clear conf->log after finishing work A NULL pointer dereferlence on conf->log is seen randomly with the mdadm test 21raid5cache. Kasan reporst: BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140 Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086 Call Trace: dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load8+0x69/0x90 r5l_reclaimable_space+0xf5/0x140 r5l_do_reclaim+0xf4/0x5e0 r5l_reclaim_thread+0x69/0x3b0 md_thread+0x1a2/0x2c0 kthread+0x177/0x1b0 ret_from_fork+0x22/0x30 This is caused by conf->log being cleared in r5l_exit_log() before stopping the reclaim thread. To fix this, clear conf->log after the reclaim_thread is unregistered and after flushing disable_writeback_work. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	7769085c8d	md/raid5-cache: Drop RCU usage of conf->log The only place that uses RCU to access conf->log is in r5l_log_disk_error(). This function is mostly used in the IO path and once with mddev_lock() held in raid5_change_consistency_policy(). It is known that the IO will be suspended before the log is freed and r5l_log_exit() is called with the mddev_lock() held. This should mean that conf->log can not be freed while the function is being called, so the RCU protection is not necessary. Drop the rcu_read_lock() as well as the synchronize_rcu() and rcu_assign_pointer() usage. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	78ede6a06f	md/raid5-cache: Take mddev_lock in r5c_journal_mode_show() The mddev->lock spinlock doesn't protect against the removal of conf->log in r5l_exit_log() so conf->log may be freed before it is used. To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	c629f345b4	md/raid5: suspend the array for calls to log_exit() The raid5-cache code relies on there being no IO in flight when log_exit() is called. There are two places where this is not guaranteed so add mddev_suspend() and mddev_resume() calls to these sites. The site in raid5_change_consistency_policy() is in the error path, and another similar call site already has suspend/resume calls just below it; so it should be equally safe to make that change here. There is one remaining site in raid5_remove_disk() that we call log_exit() without suspending the array. Unfortunately, as the comment stated, we cannot call mddev_suspend from raid5d. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	e0fccdafc2	md/raid5-ppl: Drop unused argument from ppl_handle_flush_request() ppl_handle_flush_request() takes an struct r5log argument but doesn't use it. It has no buisiness taking this argument as it is only used by raid5-cache and has no way to derference it anyway. Remove the argument. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	ed0c6a5fbe	md/raid5-log: Drop extern decorators for function prototypes extern is not necessary and recommended against when defining prototype functions in headers. checkpatch.pl complains about these. So remove them. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Linus Torvalds	6991a564f5	hardening updates for v5.20-rc1 - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi) - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld) - Fix Clang -Wforward warning in LKDTM (Justin Stitt) - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn) - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke) - Clean up warnings and overflow and KASAN tests (Kees Cook) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoEN4WHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJr19D/0fUvHaOui3+ePqKL1CEN2WOYxK Ed/HA0kM7VZnuakS2OoWbHYKurt9wImBkw0EuryNEP4nCBHy5OIyDOmWF7DjWntG 9agKLW5rRgbKe9STbGZpJ92WWosOcJkgkDVES1/NjWt7ujLiefzcZE85hj2Dt1aQ 6nF2LlkdGdtsa07hP5CR5bynQxAAxg1R1pLiJCgZRYn1SEFYtjcnBjUMrPUFJAi2 TNy6ijeG473Oj6V/JiIY88u41KG1fed22SymNj6aQVIjGpH7atn6/ooG076ydAyt QEibSyQP/CwkSbyiqVFOq4v4a+hKEB5j5F+iKZBrCnFWNvt8D3tizBYgm1NymNEZ VBZdg+UhcoVDwiMNzSaAGvt15Qv0INNkQm9PJoeUGSdXz0Yjf4ghOIeaQc3jm6Of tElawmPXxVwRZfNpf5tyPaZFphAPK5EAl35S5mdWinKbAO7Jpz9xqvoyZz9/kygR Kd4qyRPrl0YM8SBKFuYt5rFaYfw9wqF7ox7cMmwR+pbEHt7UDqDvkX2fBbpCyXza 5nJ9PDyvB5SqonIF57RiImXCLKXR6UMJgQvtDGsf+n4hpxL40Nga4pqWY82aQEhj SRdQlkFYhI/izIq1kMJNa8IoOONlJzV6i87D8iOW32bI3/SrykRUTvV3ohpMri7V UFkXzz8pfqfJ4k4zVA== =FRS3 -----END PGP SIGNATURE----- Merge tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi) - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld) - Fix Clang -Wforward warning in LKDTM (Justin Stitt) - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn) - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke) - Clean up warnings and overflow and KASAN tests (Kees Cook) * tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: dm: verity-loadpin: Drop use of dm_table_get_num_targets() kasan: test: Silence GCC 12 warnings drivers: lkdtm: fix clang -Wformat warning x86: mm: refer to the intended config STRICT_DEVMEM in a comment dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation LoadPin: Enable loading from trusted dm-verity devices dm: Add verity helpers for LoadPin stack: Declare {randomize_,}kstack_offset to fix Sparse warnings lib: overflow: Do not define 64-bit tests on 32-bit MAINTAINERS: Add a general "kernel hardening" section usercopy: use unsigned long instead of uintptr_t	2022-08-02 14:38:59 -07:00
Linus Torvalds	8374cfe647	- Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmLoAAUACgkQxSPxCi2d A1rFUAf/RnLkzNPS1QJ1uVbiiW64zQUD2o5U8kAliOZxoXQ45U+CgL22a5VaT2R+ rI9+YSg/VX9YkdJwB6I+y7VRWjUCO3NfYOfQUX5NS8GfcPQQn2Zp+jy3t8VnjjXN 5+8Ylu5tX1+QSQiBB9JvIgiK71rSorT6H+KF6bZypL7te5kj39hDaRbmh40VOMOY QWfs8DmyJft9IHPG7Mku/rFJbdVJph3f31IoqUOoGTOKoDPDUDezhCVHi2vpFcV+ S+iCKkeXmP6eUdCnNzBreYPTcP9OtCvkZEPhFg4lh+jyhjyFq1o0B8G5cRo5jvgr GcN1GUT040sSqNXFquA4UU6RGvaxlg== =gBRJ -----END PGP SIGNATURE----- Merge tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. * tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm: fix dm-raid crash if md_handle_request() splits bio dm raid: fix address sanitizer warning in raid_resume dm raid: fix address sanitizer warning in raid_status dm: Start pr_preempt from the same starting path dm: Fix PR release handling for non All Registrants dm: Start pr_reserve from the same starting path dm: Allow dm_call_pr to be used for path searches dm: return early from dm_pr_call() if DM device is suspended dm thin: fix use-after-free crash in dm_sm_register_threshold_callback dm writecache: count number of blocks discarded, not number of discard bios dm writecache: count number of blocks written, not number of write bios dm writecache: count number of blocks read, not number of read bios dm writecache: return void from functions dm kcopyd: use __GFP_HIGHMEM when allocating pages dm writecache: set a default MAX_WRITEBACK_JOBS Documentation: dm writecache: Render status list as list Documentation: dm writecache: add blank line before optional parameters dm snapshot: fix typo in snapshot_map() comment dm raid: remove redundant "the" in parse_raid_params() comment dm cache: fix typo in 2 comment blocks ...	2022-08-02 14:21:25 -07:00
Linus Torvalds	c013d0af81	for-5.20/block-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR 4N1HEozvIw== =celk -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...	2022-08-02 13:46:35 -07:00
Matthias Kaehlcke	27603a606f	dm: verity-loadpin: Drop use of dm_table_get_num_targets() Commit `2aec377a29` ("dm table: remove dm_table_get_num_targets() wrapper") in linux-dm/for-next removed the function dm_table_get_num_targets() which is used by verity-loadpin. Access table->num_targets directly instead of using the defunct wrapper. Fixes: `b6c1c5745c` ("dm: Add verity helpers for LoadPin") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220728085412.1.I242d21b378410eb6f9897a3160efb56e5608c59d@changeid	2022-07-28 21:48:12 -07:00
Nathan Huckleberry	b32d45824a	dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag Add an optional flag that ensures dm_bufio_client does not sleep (primary focus is to service dm_bufio_get without sleeping). This allows the dm-bufio cache to be queried from interrupt context. To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock instead of a mutex. Additionally, to avoid deadlocks, special care must be taken so that dm-bufio does not sleep while holding the spinlock. But again: the scope of this no_sleep is initially confined to dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to avoid sleeping because __bufio_new avoids allocation for NF_GET. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:46:14 -04:00
Nathan Huckleberry	0fcb100d50	dm bufio: Add flags argument to dm_bufio_client_create Add a flags argument to dm_bufio_client_create and update all the callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:46:14 -04:00
Mike Snitzer	9dd1cd3220	dm: fix dm-raid crash if md_handle_request() splits bio Commit `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") introduced the optimization to _not_ perform bio_associate_blkg()'s relatively costly work when DM core clones its bio. But in doing so it exposed the possibility for DM's cloned bio to alter DM target behavior (e.g. crash) if a target were to issue IO without first calling bio_set_dev(). The DM raid target can trigger an MD crash due to its need to split the DM bio that is passed to md_handle_request(). The split will recurse to submit_bio_noacct() using a bio with an uninitialized ->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to dereference a NULL blkg_to_tg(bio->bi_blkg). Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that will make alloc_tio() call bio_set_dev() on behalf of the target. dm-raid is the only target that requires this flag. bio_set_dev() initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg, before passing the bio to md_handle_request(). Long-term fix would be to audit and refactor MD code to rely on DM to split its bio, using dm_accept_partial_bio(), but there are MD raid personalities (e.g. raid1 and raid10) whose implementation are tightly coupled to handling the bio splitting inline. Fixes: `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:36:30 -04:00
Mikulas Patocka	7dad24db59	dm raid: fix address sanitizer warning in raid_resume There is a KASAN warning in raid_resume when running the lvm test lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks is greater than rs->raid_disks, so the loop touches one entry beyond the allocated length. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mikulas Patocka	1fbeea217d	dm raid: fix address sanitizer warning in raid_status There is this warning when using a kernel with the address sanitizer and running this testsuite: https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid ================================================================== BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid] Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319 CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6a/0x9c print_address_description.constprop.0+0x1f/0x1e0 print_report.cold+0x55/0x244 kasan_report+0xc9/0x100 raid_status+0x1747/0x2820 [dm_raid] dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod] table_load+0x35c/0x630 [dm_mod] ctl_ioctl+0x411/0x630 [dm_mod] dm_ctl_ioctl+0xa/0x10 [dm_mod] __x64_sys_ioctl+0x12a/0x1a0 do_syscall_64+0x5b/0x80 The warning is caused by reading conf->max_nr_stripes in raid_status. The code in raid_status reads mddev->private, casts it to struct r5conf and reads the entry max_nr_stripes. However, if we have different raid type than 4/5/6, mddev->private doesn't point to struct r5conf; it may point to struct r0conf, struct r1conf, struct r10conf or struct mpconf. If we cast a pointer to one of these structs to struct r5conf, we will be reading invalid memory and KASAN warns about it. Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	c6adada5b7	dm: Start pr_preempt from the same starting path pr_preempt has a similar issue as reserve where for all the reservation types except the All Registrants ones the preempt can create a reservation. And a follow up reservation or release needs to go down the same path the preempt did. This has the pr_preempt work like reserve and release where we always start from the first path in the first group. This commit has been tested with windows failover clustering's validation test and libiscsi's PGR tests to check for regressions. They both don't have tests to verify this case, so I tested it manually. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	08a3c338e0	dm: Fix PR release handling for non All Registrants This commit fixes a bug where we are leaving the reservation in place even though pr_release has run and returned success. If we have a Write Exclusive, Exclusive Access, or Write/Exclusive Registrants only reservation, the release must be sent down the path that is the reservation holder. The problem is multipath_prepare_ioctl most likely selected path N for the reservation, then later when we do the release multipath_prepare_ioctl will select a completely different path. The device will then return success becuase the nvme and scsi specs say to return success if there is no reservation or if the release is sent down from a path that is not the holder. We then think we have released the reservation. This commit has us loop over each path and send a release so we can make sure the release is executed on the correct path. It has been tested with windows failover clustering's validation test which checks this case, and it has been tested manually (the libiscsi PGR tests don't have a test case for this yet, but I will be adding one). Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	7015108759	dm: Start pr_reserve from the same starting path When an app does a pr_reserve it will go to whatever path we happen to be using at the time. This can result in errors when the app does a second pr_reserve call and expects success but gets a failure because the reserve is not done on the holder's path. This commit has us always start trying to do reserves from the first path in the first group. Windows failover clustering will produce the type of pattern above. With this commit, we will now pass its validation test for this case. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	8dd87f3c52	dm: Allow dm_call_pr to be used for path searches The specs state that if you send a reserve down a path that is already the holder success must be returned and if it goes down a path that is not the holder reservation conflict must be returned. Windows failover clustering will send a second reservation and expects that a device returns success. The problem for multipathing is that for an All Registrants reservation, we can send the reserve down any path but for all other reservation types there is one path that is the holder. To handle this we could add PR state to dm but that can get nasty. Look at target_core_pr.c for an example of the type of things we'd have to track. It will also get more complicated because other initiators can change the state so we will have to add in async event/sense handling. This commit, and the 3 commits that follow, tries to keep dm simple and keep just doing passthrough. This commit modifies dm_call_pr to be able to find the first usable path that can execute our pr_op then return. When dm_pr_reserve is converted to dm_call_pr in the next commit for the normal case we will use the same path for every reserve. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Snitzer	e120a5f1e7	dm: return early from dm_pr_call() if DM device is suspended Otherwise PR ops may be issued while the broader DM device is being reconfigured, etc. Fixes: `9c72bad1f3` ("dm: call PR reserve/unreserve on each underlying device") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Linus Torvalds	d945404f74	block-5.19-2022-07-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLaEtsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjuZD/4joqo0PVz+O5MUadOmFxQ+Js2RCzxzDYwN rfTlzsS2ybcoG2ehs3TPGU2c3Q+vODcgFOLJtrKPsyxEZM49uuu7+romi9/jxW4P Pju8DmA0037elEGmml4i3EiMoIGyz1ScXFpR2Ii+LLTD6y7hNxHYxC/6PgWWaO59 gIydFz9PenJmz+orlpRI08vkvifdaCCdFfqgnP+KtvJsee2FPUdg2xpA021a1Tzj 7u1qPS1JnylQP1VQypqXnZTiHzmzmkQL4GNQIo/pje/RBstTFb3HOPXEoeVvQKrl WjNntc80LZvYCMBzcksDkGAT8YLxXOckKvDiaYGObIoEoh/K6E7CBoIo4Smu3mwr NqoaN3jZXnfTtZ40VEnxI6HnIbXG8Jc96P0/7Us3lL/W/wP0fD4fqNlumbkeMW2d ualU91J6+YTGrYYWClSTSNyWQF/MNr1XRL9LQiO4GDs0fkIL1WoxaY/+AnUp0Jcc PF8tm83852PcP+YJOOYgXD04gBmg2JCOrdarE87x+fIKp2Z7V4HPeR/hH5iOu9F3 CbXaBTICJlo7jmVUMKMh5yXvTyx4TqDvpWGbpQweXG4t8c77TPMI2cIHVdj1yqgj L015BORXfsWnI0yX4IA32RqcoNwZHrrb930OxGDvUlxTrjfodlaLypYJ1wDKwRMi H8LTxE6igA== =zviX -----END PGP SIGNATURE----- Merge tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-block Pull block fix from Jens Axboe: "Just a single fix for missing error propagation for an allocation failure in raid5" * tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-block: md/raid5: missing error code in setup_conf()	2022-07-22 12:41:14 -07:00
Dan Carpenter	5f7ef4875f	md/raid5: missing error code in setup_conf() Return -ENOMEM if the allocation fails. Don't return success. Fixes: `8fbcba6b99` ("md/raid5: Cleanup setup_conf() error returns") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-07-19 10:58:33 -07:00
Shiyang Ruan	8012b86608	dax: introduce holder for dax_device Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() \|* fsdax case \|------------ \|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() \| dax_holder_notify_failure() => \| dax_device->holder_ops->notify_failure() => \| - xfs_dax_notify_failure() \| \|* xfs_dax_notify_failure() \| \|-------------------------- \| \| xfs_rmap_query_range() \| \| xfs_dax_failure_fn() \| \| * corrupted on metadata \| \| try to recover data, call xfs_force_shutdown() \| \| * corrupted on file data \| \| try to recover data, call mf_dax_kill_procs() \|* normal case \|------------- \|mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.wiliams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:30 -07:00
Luo Meng	3534e5a5ed	dm thin: fix use-after-free crash in dm_sm_register_threshold_callback Fault inject on pool metadata device reports: BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80 Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950 CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xeb/0x3f4 kasan_report.cold+0xe6/0x147 dm_pool_register_metadata_threshold+0x40/0x80 pool_ctr+0xa0a/0x1150 dm_table_add_target+0x2c8/0x640 table_load+0x1fd/0x430 ctl_ioctl+0x2c4/0x5a0 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb3/0xd0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This can be easily reproduced using: echo offline > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10 dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0" If a metadata commit fails, the transaction will be aborted and the metadata space maps will be destroyed. If a DM table reload then happens for this failed thin-pool, a use-after-free will occur in dm_sm_register_threshold_callback (called from dm_pool_register_metadata_threshold). Fix this by in dm_pool_register_metadata_threshold() by returning the -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr() with a new error message: "Error registering metadata threshold". Fixes: `ac8c3f3df6` ("dm thin: generate event when metadata threshold passed") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-15 18:09:14 -04:00
Mikulas Patocka	2ee73ef60d	dm writecache: count number of blocks discarded, not number of discard bios Change dm-writecache, so that it counts the number of blocks discarded instead of the number of discard bios. Make it consistent with the read and write statistics counters that were changed to count the number of blocks instead of bios. Fixes: `e3a35d0340` ("dm writecache: add event counters") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:54:46 -04:00
Mikulas Patocka	b2676e1482	dm writecache: count number of blocks written, not number of write bios Change dm-writecache, so that it counts the number of blocks written instead of the number of write bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:53:25 -04:00
Mikulas Patocka	2c6e755b49	dm writecache: count number of blocks read, not number of read bios Change dm-writecache, so that it counts the number of blocks read instead of the number of read bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:32 -04:00
Mikulas Patocka	9bc0c92e4b	dm writecache: return void from functions The functions writecache_map_remap_origin and writecache_bio_copy_ssd only return a single value, thus they can be made to return void. This helps simplify the following IO accounting changes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Mikulas Patocka	949d49ec30	dm kcopyd: use __GFP_HIGHMEM when allocating pages dm-kcopyd doesn't access the allocated pages directly, it only passes them to dm-io which adds them to a bio list - thus, we can allocate the pages from high memory. This will reduce pressure on the low memory when there are a large number of kcopyd jobs in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Mikulas Patocka	ca7dc242e3	dm writecache: set a default MAX_WRITEBACK_JOBS dm-writecache has the capability to limit the number of writeback jobs in progress. However, this feature was off by default. As such there were some out-of-memory crashes observed when lowering the low watermark while the cache is full. This commit enables writeback limit by default. It is set to 256MiB or 1/16 of total system memory, whichever is smaller. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Bart Van Assche	1420c4a549	fs/buffer: Combine two submit_bh() and ll_rw_block() arguments Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Bart Van Assche	a9010741ce	md/raid5: Use the enum req_op and blk_opf_t types Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-37-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Bart Van Assche	cb1802ff82	md/raid10: Use the new blk_opf_t type Improve static type checking by using the new blk_opf_t type for variables that represent a request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-36-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Bart Van Assche	3c5e514db5	md/raid1: Use the new blk_opf_t type Improve static type checking by using the new blk_opf_t type for variables that represent request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-35-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	552eee3b53	md/bcache: Combine two prio_io() arguments Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Coly Li <colyli@suse.de> Cc: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-34-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	9a4fd6a22c	md/bcache: Combine two uuid_io() arguments Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Coly Li <colyli@suse.de> Cc: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-33-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	4ce4c73f66	md/core: Combine two sync_page_io() arguments Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	13a1f650b6	dm/dm-zoned: Use the enum req_op type Improve static type checking by using the enum req_op type for arguments that represent a request operation. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-31-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	8a5a7ce877	dm/zone: Use the enum req_op type Use the enum req_op type for request operations instead of unsigned int. This patch fixes a sparse warning that has been introduced by making enum req_op __bitwise. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-30-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	6b99013957	dm-snap: Combine request operation type and flags Pass the request operation and its flags as a single argument to improve kernel code uniformity. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-29-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	c1389b3333	dm mirror log: Use the new blk_opf_t type Improve static type checking by using the new blk_opf_t type for a function argument that represents a request operation type. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-28-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	c9154a4cb8	dm/dm-integrity: Combine request operation and flags Combine the request operation type and request flags into a single argument. Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Eric Biggers <ebiggers@google.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-27-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	eff17e5161	dm/dm-flakey: Use the new blk_opf_t type Use the new blk_opf_t type for structure members that represent request flags. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-26-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	67a7b9a5b5	dm/ebs: Change 'int rw' into 'enum req_op op' Improve static type checking by using type 'enum req_op' instead of 'int'. Make the role of the 'rw' arguments more clear by renaming these into 'op' (operation). This patch does not change any functionality since REQ_OP_READ = READ = 0 and REQ_OP_WRITE = WRITE = 1. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Heinz Mauelshagen <heinzm@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-25-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	a3282b432f	dm/core: Combine request operation type and flags Improve kernel code uniformity by combining the request operation type and flags into a single variable. Change 'int rw' into 'enum req_op op' because the name 'op' is what is used in the block layer to hold a request type. Use the blk_opf_t and enum req_op types where appropriate to improve static type checking. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-24-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	71f7113d20	dm/core: Rename kcopyd_job.rw into kcopyd.op The member name 'rw' suggests that this member either has the value 'READ' or 'WRITE' and no other values. Since that member also can have the value REQ_OP_WRITE_ZEROES, rename 'rw' into 'op'. This patch does not change any functionality since REQ_OP_READ = READ = 0 and REQ_OP_WRITE = WRITE = 1. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-23-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	581075e4f6	dm/core: Reduce the size of struct dm_io_request Combine the bi_op and bi_op_flags into the bi_opf member. Use the new blk_opf_t type to improve static type checking. This patch does not change any functionality. Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-22-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:31 -06:00
Bart Van Assche	2d9b02be73	block: Change the type of req_op() and bio_op() into enum req_op Improve static type checking by changing the type of the value returned by req_op() and bio_op() from unsigned int into enum req_op. Insert 'default: break;' in switch statements on the enum req_op type to prevent that the compiler warns about these switch statements. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Bart Van Assche	ff07a02e9e	treewide: Rename enum req_opf into enum req_op The type name enum req_opf is misleading since it suggests that values of this type include both an operation type and flags. Since values of this type represent an operation only, change the type name into enum req_op. Convert the enum req_op documentation into kernel-doc format. Move a few definitions such that the enum req_op documentation occurs just above the enum req_op definition. The name "req_opf" was introduced by commit `ef295ecf09` ("block: better op and flags encoding"). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:30 -06:00
Christoph Hellwig	900d156bac	block: remove bdevname Replace the remaining calls of bdevname with snprintf using the %pg format specifier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 10:27:56 -06:00
Matthias Kaehlcke	231af47090	dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation The verity glue for LoadPin is only needed when CONFIG_SECURITY_LOADPIN_VERITY is set, use this option for conditional compilation instead of the combo of CONFIG_DM_VERITY and CONFIG_SECURITY_LOADPIN. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/lkml/20220627083512.v7.3.I5aca2dcc3b06de4bf53696cd21329dce8272b8aa@changeid Signed-off-by: Kees Cook <keescook@chromium.org>	2022-07-08 10:47:07 -07:00
Matthias Kaehlcke	b6c1c5745c	dm: Add verity helpers for LoadPin LoadPin limits loading of kernel modules, firmware and certain other files to a 'pinned' file system (typically a read-only rootfs). To provide more flexibility LoadPin is being extended to also allow loading these files from trusted dm-verity devices. For that purpose LoadPin can be provided with a list of verity root digests that it should consider as trusted. Add a bunch of helpers to allow LoadPin to check whether a DM device is a trusted verity device. The new functions broadly fall in two categories: those that need access to verity internals (like the root digest), and the 'glue' between LoadPin and verity. The new file dm-verity-loadpin.c contains the glue functions. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/lkml/20220627083512.v7.1.I3e928575a23481121e73286874c4c2bdb403355d@changeid Signed-off-by: Kees Cook <keescook@chromium.org>	2022-07-08 10:46:46 -07:00
Zhang Jiaming	962c6296f0	dm snapshot: fix typo in snapshot_map() comment Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:39 -04:00
Jiang Jian	ce92fc4b8b	dm raid: remove redundant "the" in parse_raid_params() comment Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:38 -04:00
Steven Lung	5c29e78473	dm cache: fix typo in 2 comment blocks Replace neccessarily with necessarily. Signed-off-by: Steven Lung <1030steven@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:37 -04:00
JeongHyeon Lee	20e6fc8562	dm verity: fix checkpatch close brace error Resolves: ERROR: else should follow close brace '}' Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:36 -04:00
Mike Snitzer	899ab445a4	dm table: rename dm_target variable in dm_table_add_target() Rename from "tgt" to "ti" so that all of dm-table.c code uses the same naming for dm_target variables. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:35 -04:00
Mike Snitzer	564b5c5476	dm table: audit all dm_table_get_target() callers All callers of dm_table_get_target() are expected to do proper bounds checking on the index they pass. Move dm_table_get_target() to dm-core.h to make it extra clear that only DM core code should be using it. Switch it to be inlined while at it. Standardize all DM core callers to use the same for loop pattern and make associated variables as local as possible. Rename some variables (e.g. s/table/t/ and s/tgt/ti/) along the way. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:34 -04:00
Mike Snitzer	2aec377a29	dm table: remove dm_table_get_num_targets() wrapper More efficient and readable to just access table->num_targets directly. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:33 -04:00
Ming Lei	8b211aaccb	dm: add two stage requeue mechanism Commit `61b6e2e532` ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio") reverted DM core's bio splitting back to using bio_split()+bio_chain() because it was found that otherwise DM's BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio completion that would never occur. Restore using bio_trim()+bio_inc_remaining(), like was done in commit `7dd76d1fee` ("dm: improve bio splitting and associated IO accounting"), but this time with proper handling for the above scenario that is covered in more detail in the commit header for `61b6e2e532`. Solve this issue by adding a two staged dm_io requeue mechanism that uses the new dm_bio_rewind() via dm_io_rewind(): 1) requeue the dm_io into the requeue_list added to struct mapped_device, and schedule it via new added requeue work. This workqueue just clones the dm_io->orig_bio (which DM saves and ensures its end sector isn't modified). dm_io_rewind() uses the sectors and sectors_offset members of the dm_io that are recorded relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are then used to make that cloned bio reflect the subset of the original bio that is represented by the dm_io that is being requeued. 2) the 2nd stage requeue is same with original requeue, but io->orig_bio points to new cloned bio (which matches the requeued dm_io as described above). This allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:49:32 -04:00
Ming Lei	61cbe7888d	dm: add dm_bio_rewind() API to DM core Commit `7759eb23fd` ("block: remove bio_rewind_iter()") removed a similar API for the following reasons: ``` It is pointed that bio_rewind_iter() is one very bad API[1]: 1) bio size may not be restored after rewinding 2) it causes some bogus change, such as `5151842b9d` (block: reset bi_iter.bi_done after splitting bio) 3) rewinding really makes things complicated wrt. bio splitting 4) unnecessary updating of .bi_done in fast path [1] https://marc.info/?t=153549924200005&r=1&w=2 So this patch takes Kent's suggestion to restore one bio into its original state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(), given now bio_rewind_iter() is only used by bio integrity code. ``` However, saving off a copy of the 32 bytes bio->bi_iter in case rewind needed isn't efficient because it bloats per-bio-data for what is an unlikely case. That suggestion also ignores the need to restore crypto and integrity info. Add dm_bio_rewind() API for a specific use-case that is much more narrow than the previous more generic rewind code that was reverted: 1) most bios have a fixed end sector since bio split is done from front of the bio, if driver just records how many sectors between current bio's start sector and the original bio's end sector, the original position can be restored. Keeping the original bio's end sector fixed is a _hard_ requirement for this interface! 2) if a bio's end sector won't change (usually bio_trim() isn't called, or in the case of DM it preserves original bio), user can restore the original position by storing sector offset from the current ->bi_iter.bi_sector to bio's end sector; together with saving bio size, only 8 bytes is needed to restore to original bio. 3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core needs to restore to an "original bio" which represents the current dm_io to be requeued (which may be a subset of the original bio). By storing the sector offset from the original bio's end sector and dm_io's size, dm_bio_rewind() can restore such original bio. See commit `7dd76d1fee` ("dm: improve bio splitting and associated IO accounting") for more details on how DM does this. Leveraging this, allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). 4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done to bvec_iter and there is no effect on the fast path. Implement dm_bio_rewind() by factoring out clear helpers that it calls: dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter. DM is able to ensure that dm_bio_rewind() is used safely but, given the constraint that the bio's end must never change, other hypothetical future callers may not take the same care. So make dm_bio_rewind() and all supporting code local to DM to avoid risk of hypothetical abuse. A "dm_" prefix was added to all functions to avoid any namespace collisions. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 11:26:55 -04:00
Christoph Hellwig	d86e716aa4	block: move zone related fields to struct gendisk Move the zone related fields that are currently stored in struct request_queue to struct gendisk as these are part of the highlevel block layer API and are only used for non-passthrough I/O that requires the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	de71973c29	block: remove blk_queue_zone_sectors Always use bdev_zone_sectors instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	fabed68c27	dm-zoned: cleanup dmz_fixup_devices Use the bdev based helpers where applicable and move the zoned_dev into the scope where it is actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	b623e34732	block: replace blkdev_nr_zones with bdev_nr_zones Pass a block_device instead of a request_queue as that is what most callers have at hand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:26 -06:00
Christoph Hellwig	edd1dbc83b	block: use bdev_is_zoned instead of open coding it Use bdev_is_zoned in all places where a block_device is available instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-06 06:46:25 -06:00
Roman Gushchin	e33c267ab7	mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Mikulas Patocka	617b365872	dm raid: fix KASAN warning in raid5_add_disks There's a KASAN warning in raid5_add_disk when running the LVM testsuite. The warning happens in the test lvconvert-raid-reshape-linear_to_raid6-single-type.sh. We fix the warning by verifying that rdev->saved_raid_disk is within limits. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-29 19:48:04 -04:00
Mikulas Patocka	1ebc2cec0b	dm raid: fix KASAN warning in raid5_remove_disk There's a KASAN warning in raid5_remove_disk when running the LVM testsuite. We fix this warning by verifying that the "number" variable is within limits. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-29 19:47:36 -04:00
Ming Lei	444fe04f7a	dm: improve BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling If either BLK_STS_DM_REQUEUE or BLK_STS_AGAIN is returned for POLLED io, we requeue the original bio into deferred list and kick md->wq to re-submit it to block layer. Improve the handling in the following way: 1) Factor out dm_handle_requeue() for handling dm_io requeue. 2) Unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN: clear REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity, given BLK_STS_DM_REQUEUE is very unusual. 3) Queue md->wq explicitly in dm_handle_requeue(), so requeue handling becomes more robust. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-29 14:17:16 -04:00
Christoph Hellwig	e810cb78bc	dm: refactor dm_md_mempool allocation The current split between dm_table_alloc_md_mempools and dm_alloc_md_mempools is rather arbitrary, so merge the two into one easy to follow function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-29 12:46:06 -04:00
Christoph Hellwig	4ed045d875	dm: unexport dm_get_reserved_rq_based_ios dm_get_reserved_rq_based_ios is only used in the core dm code, so remove the export. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-29 12:46:05 -04:00
Christoph Hellwig	8b9ab62662	block: remove blk_cleanup_disk blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:33:15 -06:00
Heinz Mauelshagen	332bd07787	dm raid: fix accesses beyond end of raid member array On dm-raid table load (using raid_ctr), dm-raid allocates an array rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks is defined by the number of raid metadata and image tupples passed into the target's constructor. In the case of RAID layout changes being requested, that number can be different from the current number of members for existing raid sets as defined in their superblocks. Example RAID layout changes include: - raid1 legs being added/removed - raid4/5/6/10 number of stripes changed (stripe reshaping) - takeover to higher raid level (e.g. raid5 -> raid6) When accessing array members, rs->raid_disks must be used in control loops instead of the potentially larger value in rs->md.raid_disks. Otherwise it will cause memory access beyond the end of the rs->devs array. Fix this by changing code that is prone to out-of-bounds access. Also fix validate_raid_redundancy() to validate all devices that are added. Also, use braces to help clean up raid_iterate_devices(). The out-of-bounds memory accesses was discovered using KASAN. This commit was verified to pass all LVM2 RAID tests (with KASAN enabled). Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-27 19:31:47 -04:00
Christoph Hellwig	c39493222e	dm: open code blk_max_size_offset in max_io_len max_io_len always passes an explicitly non-zero chunk_sectors into blk_max_size_offset. That means much of blk_max_size_offset is not needed and can be open coded to simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220614090934.570632-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Mikulas Patocka	90736eb323	dm mirror log: clear log bits up to BITS_PER_LONG boundary Commit `85e123c27d` ("dm mirror log: round up region bitmap size to BITS_PER_LONG") introduced a regression on 64-bit architectures in the lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation. If the device is shrunk, we need to clear log bits beyond the end of the device. The code clears bits up to a 32-bit boundary and then calculates lc->sync_count by summing set bits up to a 64-bit boundary (the commit changed that; previously, this boundary was 32-bit too). So, it was using some non-zeroed bits in the calculation and this caused misbehavior. Fix this regression by clearing bits up to BITS_PER_LONG boundary. Fixes: `85e123c27d` ("dm mirror log: round up region bitmap size to BITS_PER_LONG") Cc: stable@vger.kernel.org Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-23 14:55:43 -04:00
Ming Lei	61b6e2e532	dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio Commit `7dd76d1fee` ("dm: improve bio splitting and associated IO accounting") removed using cloned bio when dm io splitting is needed. Using bio_trim()+bio_inc_remaining() rather than bio_split()+bio_chain() causes multiple dm_io instances to share the same original bio, and it works fine if IOs are completed successfully. But a regression was caused for the case when BLK_STS_DM_REQUEUE is returned from any one of DM's cloned bios (whose dm_io share the same orig_bio). In this BLK_STS_DM_REQUEUE case only the mapped subset of the original bio for the current exact dm_io needs to be re-submitted. However, since the original bio is shared among all dm_io instances, the ->orig_bio actually only represents the last dm_io instance, so requeue can't work as expected. Also when more than one dm_io is requeued, the same original bio is requeued from all dm_io's completion handler, then race is caused. Fix this issue by still allocating one clone bio for completing io only, then io accounting can rely on ->orig_bio being unmodified. This is needed because the dm_io's sector_offset and sectors members are recorded relative to an unmodified ->orig_bio. In the future, we can go back to using bio_trim()+bio_inc_remaining() for dm's io splitting but then delay needing a bio clone only when handling BLK_STS_DM_REQUEUE, but that approach is a bit complicated (so it needs a development cycle): 1) bio clone needs to be done in task context 2) a block interface for unwinding bio is required Fixes: `7dd76d1fee` ("dm: improve bio splitting and associated IO accounting") Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-23 14:33:13 -04:00
Mike Snitzer	78ccef9123	dm: do not return early from dm_io_complete if BLK_STS_AGAIN without polling Commit `5291984004` ("dm: fix bio polling to handle possibile BLK_STS_AGAIN") inadvertently introduced an early return from dm_io_complete() without first queueing the bio to DM if BLK_STS_AGAIN occurs and bio-polling is _not_ being used. Fix this by only returning early from dm_io_complete() if the bio has first been properly queued to DM. Otherwise, the bio will never finish via bio_endio. Fixes: `5291984004` ("dm: fix bio polling to handle possibile BLK_STS_AGAIN") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-21 13:48:52 -04:00
Nikos Tsironis	9ae6e8b1c9	dm era: commit metadata in postsuspend after worker stops During postsuspend dm-era does the following: 1. Archives the current era 2. Commits the metadata, as part of the RPC call for archiving the current era 3. Stops the worker Until the worker stops, it might write to the metadata again. Moreover, these writes are not flushed to disk immediately, but are cached by the dm-bufio client, which writes them back asynchronously. As a result, the committed metadata of a suspended dm-era device might not be consistent with the in-core metadata. In some cases, this can result in the corruption of the on-disk metadata. Suppose the following sequence of events: 1. Load a new table, e.g. a snapshot-origin table, to a device with a dm-era table 2. Suspend the device 3. dm-era commits its metadata, but the worker does a few more metadata writes until it stops, as part of digesting an archived writeset 4. These writes are cached by the dm-bufio client 5. Load the dm-era table to another device. 6. The new instance of the dm-era target loads the committed, on-disk metadata, which don't include the extra writes done by the worker after the metadata commit. 7. Resume the new device 8. The new dm-era target instance starts using the metadata 9. Resume the original device 10. The destructor of the old dm-era target instance is called and destroys the dm-bufio client, which results in flushing the cached writes to disk 11. These writes might overwrite the writes done by the new dm-era instance, hence corrupting its metadata. Fix this by committing the metadata after the worker stops running. stop_worker uses flush_workqueue to flush the current work. However, the work item may re-queue itself and flush_workqueue doesn't wait for re-queued works to finish. This could result in the worker changing the metadata after they have been committed, or writing to the metadata concurrently with the commit in the postsuspend thread. Use drain_workqueue instead, which waits until the work and all re-queued works finish. Fixes: `eec40579d8` ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-21 13:35:01 -04:00
Linus Torvalds	462abc9de7	block-5.19-2022-06-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKr3gUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphQbD/9q3RlFADRFQICCPoTbh9RBflhjN3D21l7y OarLbQiOpSLT1KRrwUdzxKjNkps/tKrSWKke7znrGpgAWAUXrRX9O2mqraGwQ0JQ cxMI2DDnekQbDaIsv/9lAsF+9rPaSU1NQR09jm4jw5Wc59XnY6mjJF+Lb83Psoow r4MRMXg+JQZGOn4ekN/RujwRaDluK+R1RQPUePXM1/LZPbsbTMFbQR1hrPIcrsYp R5f2an3mpmOrYpVAZQBEl5F451pKv0lah+QVd6VY8/CyReWLTzKh4yHYZs7fgkAK 0aTxXEskrQozLYQgEPefTDt1JoyHqMLQ1EIvVbSoGrlcgdmbjEpDd0ee2IhBdq6m 80gJZOFlE/K4r5RFClIhrsYdJVdSQ+fHGqO+R0WLxWe+w0T80X6tUEewSRO0WnB9 nlXvlj/SDmeXmsrkjFGH55yLubgKTXigCbwwvtdv24zsmmq5QQSgMrTkdey4NH7U C3zs8PVvGsqlXR7/N5GkaSvE/zPe6mx+pXF9SdzFwAAYYglqTZs1moInT3WGrqBw iK/aKJkoCDh06LHkNmQXKh2nTNCl/VLFvlcmyTfbKthfQC6dWD8v3Qe7ce4P7kyM bxEF9hAdJSPNPWXMITU4CRCz6JqK0vZUAUNn8g5lf7SJ1QFrqciwqAQF3/OIxeS0 3+Qj2izcPA== =5gxO -----END PGP SIGNATURE----- Merge tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Christoph - Quirks, quirks, quirks to work around buggy consumer grade devices (Keith Bush, Ning Wang, Stefan Reiter, Rasheed Hsueh) - Better kernel messages for devices that need quirking (Keith Bush) - Make a kernel message more useful (Thomas Weißschuh) - MD pull request from Song, with a few fixes - blk-mq sysfs locking fixes (Ming) - BFQ stats fix (Bart) - blk-mq offline queue fix (Bart) - blk-mq flush request tag fix (Ming) * tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block: block/bfq: Enable I/O statistics blk-mq: don't clear flush_rq from tags->rqs[] blk-mq: avoid to touch q->elevator without any protection blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none block: Fix handling of offline queues in blk_mq_alloc_request_hctx() md/raid5-ppl: Fix argument order in bio_alloc_bioset() Revert "md: don't unregister sync_thread with reconfig_mutex held" nvme-pci: disable write zeros support on UMIC and Samsung SSDs nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs nvme-pci: sk hynix p31 has bogus namespace ids nvme-pci: smi has bogus namespace ids nvme-pci: phison e12 has bogus namespace ids nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50 nvme-pci: add trouble shooting steps for timeouts nvme: add bug report info for global duplicate id nvme: add device name to warning in uuid_show()	2022-06-17 11:22:58 -07:00
Mikulas Patocka	85e123c27d	dm mirror log: round up region bitmap size to BITS_PER_LONG The code in dm-log rounds up bitset_size to 32 bits. It then uses find_next_zero_bit_le on the allocated region. find_next_zero_bit_le accesses the bitmap using unsigned long pointers. So, on 64-bit architectures, it may access 4 bytes beyond the allocated size. Fix this bug by rounding up bitset_size to BITS_PER_LONG. This bug was found by running the lvm2 testsuite with kasan. Fixes: `29121bd0b0` ("[PATCH] dm mirror log: bitset_size fix") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:39:29 -04:00
Mikulas Patocka	1ee88de395	dm: fix narrow race for REQ_NOWAIT bios being issued despite no support Starting with the commit 63a225c9fd20, device mapper has an optimization that it will take cheaper table lock (dm_get_live_table_fast instead of dm_get_live_table) if the bio has REQ_NOWAIT. The bios with REQ_NOWAIT must not block in the target request routine, if they did, we would be blocking while holding rcu_read_lock, which is prohibited. The targets that are suitable for REQ_NOWAIT optimization (and that don't block in the map routine) have the flag DM_TARGET_NOWAIT set. Device mapper will test if all the targets and all the devices in a table support nowait (see the function dm_table_supports_nowait) and it will set or clear the QUEUE_FLAG_NOWAIT flag on its request queue according to this check. There's a test in submit_bio_noacct: "if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q)) goto not_supported" - this will make sure that REQ_NOWAIT bios can't enter a request queue that doesn't support them. This mechanism works to prevent REQ_NOWAIT bios from reaching dm targets that don't support the REQ_NOWAIT flag (and that may block in the map routine) - except that there is a small race condition: submit_bio_noacct checks if the queue has the QUEUE_FLAG_NOWAIT without holding any locks. Immediatelly after this check, the device mapper table may be reloaded with a table that doesn't support REQ_NOWAIT (for example, if we start moving the logical volume or if we activate a snapshot). However the REQ_NOWAIT bio that already passed the check in submit_bio_noacct would be sent to device mapper, where it could be redirected to a dm target that doesn't support REQ_NOWAIT - the result is sleeping while we hold rcu_read_lock. In order to fix this race, we double-check if the target supports REQ_NOWAIT while we hold the table lock (so that the table can't change under us). Fixes: `563a225c9f` ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:39:02 -04:00
Mikulas Patocka	5d7362d0d5	dm: fix use-after-free in dm_put_live_table_bio dm_put_live_table_bio is called from the end of dm_submit_bio. However, at this point, the bio may be already finished and the caller may have freed the bio. Consequently, dm_put_live_table_bio accesses the stale "bio" pointer. Fix this bug by loading the bi_opf value and passing it to dm_get_live_table_bio and dm_put_live_table_bio instead of the bio. This bug was found by running the lvm2 testsuite with kasan. Fixes: `563a225c9f` ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:38:49 -04:00
Logan Gunthorpe	f34fdcd4a0	md/raid5-ppl: Fix argument order in bio_alloc_bioset() bio_alloc_bioset() takes a block device, number of vectors, the OP flags, the GFP mask and the bio set. However when the prototype was changed, the callisite in ppl_do_flush() had the OP flags and the GFP flags reversed. This introduced some sparse error: drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3 (different base types) drivers/md/raid5-ppl.c:632:57: expected unsigned int opf drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype] drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4 (different base types) drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype] gfp_mask drivers/md/raid5-ppl.c:633:61: got unsigned long long The sparse error introduction may not have been reported correctly by 0day due to other work that was cleaning up other sparse errors in this area. Fixes: `609be10667` ("block: pass a block_device and opf to bio_alloc_bioset") Cc: stable@vger.kernel.org # 5.18+ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-06-15 10:32:48 -07:00
Guoqing Jiang	d0a180341f	Revert "md: don't unregister sync_thread with reconfig_mutex held" The 07reshape5intr test is broke because of below path. md_reap_sync_thread -> mddev_unlock -> md_unregister_thread(&mddev->sync_thread) And md_check_recovery is triggered by, mddev_unlock -> md_wakeup_thread(mddev->thread) then mddev->reshape_position is set to MaxSector in raid5_finish_reshape since MD_RECOVERY_INTR is cleared in md_check_recovery, which means feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's reshape_position can't be updated accordingly. Fixes: `8b48ec23cc` ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-06-15 10:30:14 -07:00
Benjamin Marzinski	10eb3a0d51	dm: fix race in dm_start_io_acct After commit `82f6cdcc36` ("dm: switch dm_io booleans over to proper flags") dm_start_io_acct stopped atomically checking and setting was_accounted, which turned into the DM_IO_ACCOUNTED flag. This opened the possibility for a race where IO accounting is started twice for duplicate bios. To remove the race, check the flag while holding the io->lock. Fixes: `82f6cdcc36` ("dm: switch dm_io booleans over to proper flags") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-15 11:51:41 -04:00
Mike Snitzer	dddf305640	dm: fix zoned locking imbalance due to needless check in clone_endio After the commit `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone"), clone_endio() only calls dm_zone_endio() when DM targets remap the clone bio's bdev to something other than the md->disk->part0 default. However, if a DM target (e.g. dm-crypt) stacked ontop of a dm-zoned does not remap the clone bio using bio_set_dev() then dm_zone_endio() is not called at completion of the bios and zone locks are not properly unlocked. This triggers a hang, in dm_zone_map_bio(), when blktests block/004 is run for dm-crypt on zoned block devices. To avoid the hang, simply remove the clone_endio() check that verifies the target remapped the clone bio to a device other than the default. Fixes: `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-10 15:23:54 -04:00
Christoph Hellwig	29dec90a0f	dm: fix bio_set allocation The use of bioset_init_from_src mean that the pre-allocated pools weren't used for anything except parameter passing, and the integrity pool creation got completely lost for the actual live mapped_device. Fix that by assigning the actual preallocated dm_md_mempools to the mapped_device and using that for I/O instead of creating new mempools. Fixes: `2a2a4c510b` ("dm: use bioset_init_from_src() to copy bio_set") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-08 14:04:14 -04:00
Linus Torvalds	78c6499c92	for-5.19/drivers-2022-06-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK UhawfyZE3Q== =YqhC -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block Pull more block driver updates from Jens Axboe: "A collection of stragglers that were late on sending in their changes and just followup fixes. - NVMe fixes pull request via Christoph: - set controller enable bit in a separate write (Niklas Cassel) - disable namespace identifiers for the MAXIO MAP1001 (Christoph) - fix a comment typo (Julia Lawall)" - MD fixes pull request via Song: - Remove uses of bdevname (Christoph Hellwig) - Bug fixes (Guoqing Jiang, and Xiao Ni) - bcache fixes series (Coly) - null_blk zoned write fix (Damien) - nbd fixes (Yu, Zhang) - Fix for loop partition scanning (Christoph)" * tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits) block: null_blk: Fix null_zone_write() nvmet: fix typo in comment nvme: set controller enable bit in a separate write nvme-pci: disable namespace identifiers for the MAXIO MAP1001 bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() nbd: use pr_err to output error message nbd: fix possible overflow on 'first_minor' in nbd_dev_add() nbd: fix io hung while disconnecting device nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed nbd: fix race between nbd_alloc_config() and module removal nbd: call genl_unregister_family() first in nbd_cleanup() md: bcache: check the return value of kzalloc() in detached_dev_do_request() bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() block, loop: support partitions without scanning bcache: avoid journal no-space deadlock by reserving 1 journal bucket bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: improve multithreaded bch_sectors_dirty_init() bcache: improve multithreaded bch_btree_check() md: fix double free of io_acct_set bioset md: Don't set mddev private to NULL in raid0 pers->free ...	2022-06-03 10:25:56 -07:00
Linus Torvalds	fa78526acc	- Fix DM core's dm_table_supports_poll to return false if no data devices. - Fix DM verity target so that it cannot be switched to a different DM target type (e.g. dm-linear) via DM table reload. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKX0aMACgkQxSPxCi2d A1okwgf+NMX+Il0qgda1mZrId7nVdQGGyv7uapLXIxRm5Z+LcUxqw+hqzoeabeQ/ 7rou3KqsXuPcpu1AATPHis0Ub9CcXwtpbevf2rmh3Ey4kqLLuFqUP6IjwvFtyp2Y Ms9QAhTvIZXxAPcrb7HH2v7ULOCmdI89OAr8Q/hQ+F4wjI8BO2tNJ4WfxeqpMy5M EFwEO485Ct+XvLDek4+7hYxvSO/6ANgjgzWx4dwsP+iC9SFJurvNVnoXpIl+69DU v8R6Udp0buQqFscyfRbHVOYxVkBROMWg/lKX/4hhgiSoV8j5xSm9hp3S13BffHj7 Bbp8cW+E6IkaASDpIRBAa/6a6k7K1w== =+C+g -----END PGP SIGNATURE----- Merge tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core's dm_table_supports_poll to return false if target has no data devices. - Fix DM verity target so that it cannot be switched to a different DM target type (e.g. dm-linear) via DM table reload. * tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: set DM_TARGET_IMMUTABLE feature flag dm table: fix dm_table_supports_poll to return false if no data devices	2022-06-01 14:25:04 -07:00
Sarthak Kukreti	4caae58406	dm verity: set DM_TARGET_IMMUTABLE feature flag The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type. Fixes: `a4ffc15219` ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-31 16:22:30 -04:00
Mike Snitzer	9571f829f3	dm table: fix dm_table_supports_poll to return false if no data devices It was reported that the "generic/250" test in xfstests (which uses the dm-error target) demonstrates a regression where the kernel crashes in bioset_exit(). Since commit `cfc97abcbe` ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") the bioset_init() for the dm_io bioset will setup the bioset's per-cpu alloc cache if all devices have QUEUE_FLAG_POLL set. But there was an bug where a target that doesn't have any data devices (and that doesn't even set the .iterate_devices dm target callback) will incorrectly return true from dm_table_supports_poll(). Fix this by updating dm_table_supports_poll() to follow dm-table.c's well-worn pattern for testing that _all_ targets in a DM table do in fact have underlying devices that set QUEUE_FLAG_POLL. NOTE: An additional block fix is still needed so that bio_alloc_cache_destroy() clears the bioset's ->cache member. Otherwise, a DM device's table reload that transitions the DM device's bioset from using a per-cpu alloc cache to _not_ using one will result in bioset_exit() crashing in bio_alloc_cache_destroy() because dm's dm_io bioset ("io_bs") was left with a stale ->cache member. Fixes: `cfc97abcbe` ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-31 14:44:17 -04:00
Coly Li	a1a2d8f016	bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() The kworker routine update_writeback_rate() is schedued to update the writeback rate in every 5 seconds by default. Before calling __update_writeback_rate() to do real job, semaphore dc->writeback_lock should be held by the kworker routine. At the same time, bcache writeback thread routine bch_writeback_thread() also needs to hold dc->writeback_lock before flushing dirty data back into the backing device. If the dirty data set is large, it might be very long time for bch_writeback_thread() to scan all dirty buckets and releases dc->writeback_lock. In such case update_writeback_rate() can be starved for long enough time so that kernel reports a soft lockup warn- ing started like: watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713] Such soft lockup condition is unnecessary, because after the writeback thread finishes its job and releases dc->writeback_lock, the kworker update_writeback_rate() may continue to work and everything is fine indeed. This patch avoids the unnecessary soft lockup by the following method, - Add new member to struct cached_dev - dc->rate_update_retry (0 by default) - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock) firstly, if it fails then lock contention happens. - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't acquire the lock and reschedules the kworker for next try. - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry anymore and call down_read(&dc->writeback_lock) to wait for the lock. By the above method, at worst case update_writeback_rate() may retry for 1+ minutes before blocking on dc->writeback_lock by calling down_read(). For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft lockup warning message can be avoided. When retrying to acquire dc->writeback_lock in update_writeback_rate(), of course the writeback rate cannot be updated. It is fair, because when the kworker is blocked on the lock contention of dc->writeback_lock, the writeback rate cannot be updated neither. This change follows Jens Axboe's suggestion to a more clear and simple version. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:48:26 -06:00
Linus Torvalds	35cdd8656e	libnvdimm for 5.19 - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj kxOtJEU2bA9naninXGORsDUndiNkwQo= =gVIn -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue	2022-05-27 15:49:30 -07:00
Jia-Ju Bai	40f567bbb3	md: bcache: check the return value of kzalloc() in detached_dev_do_request() The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked. Fixes: `bc082a55d2` ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 09:49:48 -06:00
Coly Li	7d6b902ea0	bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() The local variables check_state (in bch_btree_check()) and state (in bch_sectors_dirty_init()) should be fully filled by 0, because before allocating them on stack, they were dynamically allocated by kzalloc(). Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 09:49:48 -06:00
Linus Torvalds	7e284070ab	- Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKOiAwACgkQxSPxCi2d A1qCxgf/VLmiywUR7zCIDiPyJkc547z2MlPVaTCzpclklyLZ5wgfTtqsb8RnEqRs KIXrvKbakbfZtrG0k9ZYdbIIVO4jHQavCXtVZH0Mj4XqQArDhHxZzlBx8i6BirO5 aya7HKWHcXj+s7GF146hRcPtD53LbSNDrB/bVaiZcHxOemthSgtht2xwmZfzoCcS JtorpWkA97lgDbdSAcR0kn1zkHDWFkQMJgnUL9FotnlhOVXZVUvAbgb4OjfxB3fG bCYlytDhs+6VwO+/CZXcOA7LNNtkLJPEkxUxjrChS5do64SgJUlDvY66iB6+7K0w 0Jri5D/NZvjwYVeLgXn5UwyQJ75cjQ== =zlaE -----END PGP SIGNATURE----- Merge tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. * tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm: pass NULL bdev to bio_alloc_clone dm cache metadata: remove unnecessary variable in __dump_mapping dm mpath: provide high-resolution timer to HST for bio-based dm crypt: make printing of the key constant-time dm integrity: fix error code in dm_integrity_ctr() dm stats: add cond_resched when looping over entries dm: improve abnormal bio processing dm: simplify bio-based IO accounting further dm: put all polled dm_io instances into a single list dm: improve dm_io reference counting dm: don't grab target io reference in dm_zone_map_bio dm: improve bio splitting and associated IO accounting dm: switch to bdev based IO accounting interfaces dm: pass dm_io instance to dm_io_acct directly dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm: use bio_sectors in dm_aceept_partial_bio dm: simplify basic targets dm: conditionally enable branching for less used features dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio dm: move hot dm_io members to same cacheline as dm_target_io ...	2022-05-26 21:13:45 -07:00
Coly Li	32feee36c3	bcache: avoid journal no-space deadlock by reserving 1 journal bucket The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	80db4e4707	bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree(). That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one. In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes. The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does. But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	4dc34ae1b4	bcache: improve multithreaded bch_sectors_dirty_init() Commit `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved. This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible(). With the above changes, the code is more clear, and some potential error conditions are avoided. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	622536443b	bcache: improve multithreaded bch_btree_check() Commit `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved. This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible(). With this change, the code is more clear, and some potential error conditions are avoided. Fixes: `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Linus Torvalds	5dc921868c	for-5.19/drivers-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrTcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgph/REAC0/7odRfJeTJ1PkJhSKFc7dhyS7rK4du2s 3+z+H6Yeua2yVIJb0mYYGEJcOUUQ9nD2T9424n3NzDOw88U4y8Vg2YEH+UiJBuj4 AJoxPNkQdxL7WzmwHmRNLCcOOFhISLqWiCJSr45d+LP1f6aO24Q9lewYWxtNA4TW mqb7Ne7e3Z77m9rmsCsZ26bzQHg1EEQ6qgjZM9tqMhOeTqYhmrqfrD9KtG8TIkpK N8277E5QcequHf7v6VpKqEOzf3d2kx55JaZdu+oxLPVMED3wJJFwcYF1/xmM7Fgx tp7xCjqqUHXwKvJNCFJpnvw+cXu0Ct7cWOIG4ROCvaTD4vBI1KzZLc0gO7pKFW0Y hNIlMXr4n8PmonS81tMV4TqmRWxedX/jxuaeJCVNr89PqYU4luPpigJZqv7rlGry KZUlktQot22M/7FC2MS6KhgbQKLPrRGTAEyY/JNwBHckCZiduWQFlmKLQ926xQIJ 6vdjSzHK5MrT/d+yow3bGFxAJWloGJ+L+RsH0b+WikF81+6ic9P3AoStgbVilfKD 6sbjcju8SShDlQ+W/Ocm0rHC+i/RDKT3QqItXgfhA/1FfMPODQGc/xcZg+AdTswn VSnUIkvk9/mTO0StilVfNJDfG1QkSpJ5Ilvs/DnIahZj6IG4QbJvtnVNbmQX6ptz AUB4DdGwXg== =geQL -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the driver updates queued up for 5.19. This contains: - NVMe pull requests via Christoph: - tighten the PCI presence check (Stefan Roese) - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, Christoph) - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (Christoph) - MD pull request via Song: - Improve annotation in raid5 code, by Logan Gunthorpe - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk - Other small fixes/cleanups - null_blk series making the configfs side much saner (Damien) - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng, Arnd, Cai) - Avoid using the system workqueue (and hence flushing it) in rnbd (Jack) - Avoid using the system workqueue (and hence flushing it) in aoe (Tetsuo) - Series fixing discard_alignment issues in drivers (Christoph) - Small series fixing drivers poking at disk->part0 for openers information (Christoph) - Series fixing deadlocks in loop (Christoph, Tetsuo) - Remove loop.h and add SPDX headers (Christoph) - Various fixes and cleanups (Julia, Xie, Yu)" * tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits) mtip32xx: fix typo in comment nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements nvme: split the enum used for various register constants nbd: Fix hung on disconnect request if socket is closed before nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL loop: remove most the top-of-file boilerplate comment from the UAPI header loop: remove most the top-of-file boilerplate comment loop: add a SPDX header loop: remove loop.h block: null_blk: Improve device creation with configfs block: null_blk: Cleanup messages block: null_blk: Cleanup device creation and deletion ...	2022-05-23 14:04:14 -07:00
Linus Torvalds	115cd47132	for-5.19/block-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis iomvJ4yk0Q== =0Rw1 -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...	2022-05-23 13:56:39 -07:00
Xiao Ni	42b805af10	md: fix double free of io_acct_set bioset Now io_acct_set is alloc and free in personality. Remove the codes that free io_acct_set in md_free and md_stop. Fixes: `0c031fd37f` (md: Move alloc/free acct bioset in to personality) Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:22 -07:00
Xiao Ni	0f2571ad7a	md: Don't set mddev private to NULL in raid0 pers->free In normal stop process, it does like this: do_md_stop \| __md_stop (pers->free(); mddev->private=NULL) \| md_free (free mddev) __md_stop sets mddev->private to NULL after pers->free. The raid device will be stopped and mddev memory is free. But in reshape, it doesn't free the mddev and mddev will still be used in new raid. In reshape, it first sets mddev->private to new_pers and then runs old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free. The new raid can't work anymore. It will panic when dereference mddev->private because of NULL pointer dereference. It can panic like this: [63010.814972] kernel BUG at drivers/md/raid10.c:928! [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1 [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020 [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10] [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246 [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000 [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800 [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000 [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200 [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003 [63010.906399] FS: 0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000 [63010.914485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0 [63010.927363] Call Trace: [63010.929822] ? bio_reset+0xe/0x40 [63010.933144] ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10] [63010.938629] raid10_sync_request+0x756/0x1610 [raid10] [63010.943770] md_do_sync.cold+0x3e4/0x94c [63010.947698] md_thread+0xab/0x160 [63010.951024] ? md_write_inc+0x50/0x50 [63010.954688] kthread+0x149/0x170 [63010.957923] ? set_kthread_struct+0x40/0x40 [63010.962107] ret_from_fork+0x22/0x30 Removing the code that sets mddev->private to NULL in raid0 can fix problem. Fixes: `0c031fd37f` (md: Move alloc/free acct bioset in to personality) Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Christoph Hellwig	913cce5a1e	md: remove most calls to bdevname Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Guoqing Jiang	1e26774228	md: protect md_unregister_thread from reentrancy Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Guoqing Jiang	8b48ec23cc	md: don't unregister sync_thread with reconfig_mutex held Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Jane Chu	047218ec90	dax: add .recovery_write dax_operation Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2022-05-16 13:37:59 -07:00
Jane Chu	e511c4a3d2	dax: introduce DAX_RECOVERY_WRITE dax access mode Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2022-05-16 13:35:56 -07:00
Mike Snitzer	ca522482e3	dm: pass NULL bdev to bio_alloc_clone Most DM targets will remap the clone bio passed to their ->map function using bio_set_bdev(). So this change to pass NULL bdev to bio_alloc_clone avoids clone-time work that sets up resources for a bdev association that will not be used in practice (e.g. clone issued to underlying device will not use DM device's blk-cgroups resources). But clone->bi_bdev is still initialized following bio_alloc_clone to preserve DM target expectations that clone->bi_bdev will be set. Follow-up work is needed to audit DM targets to remove accesses to a clone->bi_bdev that the target didn't initialize with bio_set_dev(). Depends-on: `7ecc56c62b` ("block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-11 13:58:52 -04:00
Guo Zhengkui	d254c3699f	dm cache metadata: remove unnecessary variable in __dump_mapping Fix the following coccicheck warning: drivers/md/dm-cache-metadata.c:1512:5-6: Unneeded variable: "r". Return "0" on line 1520. Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 15:40:10 -04:00
Gabriel Krisman Bertazi	c06dfd124d	dm mpath: provide high-resolution timer to HST for bio-based The precision loss of reading IO start_time with jiffies_to_nsecs instead of using a high resolution timer degrades HST path prediction for BIO-based mpath on high load workloads. Below, I show the utilization percentage of a 10 disk multipath with asymmetrical disk access cost, while being exercised by a randwrite FIO benchmark with high submission queue depth (depth=64). It is possible to see that the HST path selection degrades heavily for high-iops in BIO-mpath, underutilizing the slower paths way beyond expected. This seems to be caused by the start_time truncation, which makes some IO to seem much slower than it actually is. In this scenario ST outperforms HST for bio-mpath, but not for mq-mpath, which already uses ktime_get_ns(). The third column shows utilization with this patch applied. It is easy to see that now HST prediction is much closer to the ideal distribution (calculated considering the real cost of each path). \| \| ST \| HST (orig) \| HST(ktime) \| Best \| \| sdd \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sde \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdf \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdg \| 0.06 \| 0.00 \| 0.06 \| 0.04 \| \| sdh \| 0.03 \| 0.00 \| 0.03 \| 0.02 \| \| sdi \| 0.03 \| 0.00 \| 0.03 \| 0.02 \| \| sdj \| 0.02 \| 0.00 \| 0.01 \| 0.01 \| \| sdk \| 0.02 \| 0.00 \| 0.01 \| 0.01 \| \| sdl \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdm \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| This issue was originally discussed [1] when we first merged HST, and this patch was left as a low hanging fruit to be solved later. Regarding the implementation, as suggested by Mike in that mail thread, in order to avoid the overhead of ktime_get_ns for other selectors, this patch adds a flag for the selector code to request the high-resolution timer. I tested this using the same benchmark used in the original HST submission. Full test and benchmark scripts are available here: https://people.collabora.com/~krisman/HST-BIO-MPATH/ [1] https://lore.kernel.org/lkml/85tv0am9de.fsf@collabora.com/T/ Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [snitzer: cleaned up various implementation details] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 15:39:23 -04:00
Mikulas Patocka	567dd8f345	dm crypt: make printing of the key constant-time The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor. Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:34:03 -04:00
Dan Carpenter	d3f2a14b89	dm integrity: fix error code in dm_integrity_ctr() The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this: drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r' Fixes: `7eada909bf` ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:14:00 -04:00
Mikulas Patocka	bfe2b0146c	dm stats: add cond_resched when looping over entries dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:11:07 -04:00
Mike Snitzer	4edadf6dcb	dm: improve abnormal bio processing Read/write/flush are the most common operations, optimize switch in is_abnormal_io() for those cases. Follows same pattern established in block perf-wip commit ("block: optimise blk_may_split for normal rw") Also, push is_abnormal_io() check and blk_queue_split() down from dm_submit_bio() to dm_split_and_process_bio() and set new 'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio and __process_abnormal_io by leveraging ci.is_abnormal_io flag. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Mike Snitzer	9d20653fe8	dm: simplify bio-based IO accounting further Now that io splitting is recorded prior to, or during, ->map IO accounting can happen immediately rather than defer until after bio splitting in dm_split_and_process_bio(). Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task member because there is no longer any need to wait for splitting to occur before accounting. Also move dm_io struct's 'flags' member to consolidate struct holes. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	ec211631ae	dm: put all polled dm_io instances into a single list Now that bio_split() isn't used by DM's bio splitting, it is a bit overkill to link dm_io into an hlist given there is only single dm_io in the list. Convert to using a single list for holding all dm_io instances associated with this bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	0f14d60a02	dm: improve dm_io reference counting Currently each dm_io's reference counter is grabbed before calling __map_bio(), this way isn't efficient since we can move this grabbing to initialization time inside alloc_io(). Meantime it becomes typical async io reference counter model: one is for submission side, the other is for completion side, and the io won't be completed until both sides are done. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	2e803cd99b	dm: don't grab target io reference in dm_zone_map_bio dm_zone_map_bio() is only called from __map_bio in which the io's reference is grabbed already, and the reference won't be released until the bio is submitted, so not necessary to do it dm_zone_map_bio any more. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	7dd76d1fee	dm: improve bio splitting and associated IO accounting The current DM code (ab)uses late assignment of dm_io->orig_bio (after __map_bio() returns and any bio splitting is complete) to indicate the FS bio has been processed and can be accounted. This results in awkward waiting until ->orig_bio is set in dm_submit_bio_remap(). Also the bio splitting was implemented using bio_split()+bio_chain() -- a well-worn pattern but it requires bio cloning purely for the benefit of more natural IO accounting. The bio_split() result was stored in ->orig_bio to represent the mapped part of the original FS bio. DM has switched to the bdev based IO accounting interface. DM's IO accounting can be implemented in terms of the original FS bio (now stored early in ->orig_bio) via access to its sectors/bio_op. And if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use new dm_io fields of .sector_offset & .sectors to allow IO accounting for split bios _without_ needing to clone a new bio to store in ->orig_bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	d3de6d1269	dm: switch to bdev based IO accounting interfaces DM splits flush with data into empty flush followed by bio with data payload, switch dm_io_acct() to use bdev_{start,end}_io_acct() to do this accoiunting more naturally (rather than temporarily changing the bio's bi_size). This will allow DM to more easily account bios that are split (in following commit). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	e6926ad0c9	dm: pass dm_io instance to dm_io_acct directly All the other 4 parameters are retrieved from the 'dm_io' instance, so it's not necessary to pass all four to dm_io_acct(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	b992b40dfc	dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct, so it isn't necessary to take one bio parameter for the two helpers. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	bdb34759a0	dm: use bio_sectors in dm_aceept_partial_bio Rename 'bi_size' to 'bio_sectors' given bi_size is being stored in sectors. Also, use bio_sectors() rather than open-coding it. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	e86f2b005a	dm: simplify basic targets Remove needless factoring and remap bi_sector regardless of bio_sectors() being non-zero. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	442761fd2b	dm: conditionally enable branching for less used features Use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	563a225c9f	dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio If a bio is marked REQ_NOWAIT optimize dm_submit_bio()'s dm_table RCU usage to dm_{get,put}_live_table_fast. DM core offers protection against blocking (via suspend) if REQ_NOWAIT. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	982b48ae25	dm: move hot dm_io members to same cacheline as dm_target_io Just saves some cacheline bouncing for members accessed during cloned bio submission and completion. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	6cbce280fc	dm: add local variables to clone_endio and __map_bio Avoid redundant dereferences in both functions. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	fe221db419	dm: mark various branches unlikely Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	3b03f7c124	dm: simplify dm_start_io_acct Pull common DM_IO_ACCOUNTED check out to beginning of dm_start_io_acct. Also, use dm_tio_is_normal (and move it to dm-core.h). Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	4857abf664	dm: simplify dm_io access in dm_split_and_process_bio Use local variable instead of redudant access using ci.io Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:33 -04:00
Mike Snitzer	84b98f4ce4	dm: factor out dm_io_set_error and __dm_io_dec_pending Also eliminate need to use errno_to_blk_status(). Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:33 -04:00
Mike Snitzer	cfc97abcbe	dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset A bioset's per-cpu alloc cache may have broader utility in the future but for now constrain it to being tightly coupled to QUEUE_FLAG_POLL. Also change dm_io_complete() to use bio_clear_polled() so that it properly clears all associated bio state on requeue. This commit improves DM's hipri bio polling (REQ_POLLED) perf by 7 - 20% depending on the system. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:33 -04:00
Christoph Hellwig	3d50d368c9	raid5: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by raid5 is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	44d583702f	dm-zoned: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by dm-zoned is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
David Sloan	9151ad5d86	md: Replace role magic numbers with defined constants There are several instances where magic numbers are used in md.c instead of the defined constants in md_p.h. This patch set improves code readability by replacing all occurrences of 0xffff, 0xfffe, and 0xfffd when relating to md roles with their equivalent defined constant. Signed-off-by: David Sloan <david.sloan@eideticom.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:22:31 -07:00
Pascal Hambourg	ea23994edc	md/raid0: Ignore RAID0 layout if the second zone has only one device The RAID0 layout is irrelevant if all members have the same size so the array has only one zone. It is also irrelevant if the array has two zones and the second zone has only one device, for example if the array has two members of different sizes. So in that case it makes sense to allow assembly even when the layout is undefined, like what is done when the array has only one zone. Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	4631f39f05	md/raid5: Annotate functions that hold device_lock with __must_hold A handful of functions note the device_lock must be held with a comment but this is not comprehensive. Many other functions hold the lock when taken so add an __must_hold() to each call to annotate when the lock is held. This makes it a bit easier to analyse device_lock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	4f4ee2bf32	md/raid5-ppl: Annotate with rcu_dereference_protected() To suppress the last remaining sparse warnings about accessing rdev, add rcu_dereference_protected calls to a couple places in raid5-ppl. All of these places are called under raid5_run and therefore are occurring before the array has started and is thus safe. There's no sensible check to do for the second argument of rcu_dereference_protected() so a comment is added instead. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	9aeb7f99a1	md/raid5: Annotate rdev/replacement access when mddev_lock is held The mddev_lock should be held during raid5_remove_disk() which is when the rdev/replacement pointers are modified. So any access to these pointers marked __rcu should be safe whenever the mddev_lock is held. There are numerous such access that currently produce sparse warnings. Add a helper function, rdev_mdlock_deref() that wraps rcu_dereference_protected() in all these instances. This annotation fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	e38b043255	md/raid5: Annotate rdev/replacement accesses when nr_pending is elevated There are a number of accesses to __rcu variables that should be safe because nr_pending in the disk is known to be elevated. Create a wrapper around rcu_dereference_protected() to annotate these accesses and verify that nr_pending is non-zero. This fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	b0920ede08	md/raid5: Add __rcu annotation to struct disk_info rdev and replacement are protected in some circumstances with rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However, they were not annotated with __rcu so a sparse warning is emitted for every rcu_dereference() call. Add the __rcu annotation and fix up the initialization with RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(), a few cases where the pointer value is tested with rcu_access_pointer() and one case where READ_ONCE() is used instead of rcu_dereference(), a case in print_raid5_conf() that should have rcu_dereference() and rcu_read_[un]lock() calls. Additional sparse issues will be fixed up in further commits. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	3d9a644cf4	md/raid5: Un-nest struct raid5_percpu definition Sparse reports many warnings of the form: drivers/md/raid5.c:1476:16: warning: dereference of noderef expression This is because all struct raid5_percpu definitions get marked as __percpu when really only the pointer in r5conf should have that annotation. Fix this by moving the defnition of raid5_precpu out of the definition of struct r5conf. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	8fbcba6b99	md/raid5: Cleanup setup_conf() error returns Be more careful about the error returns. Most errors in this function are actually ENOMEM, but it forcibly returns EIO if conf has been allocated. Instead return ret and ensure it is set appropriately before each goto abort. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Heming Zhao	92d9aac92b	md: replace deprecated strlcpy & remove duplicated line This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit `cf921cc19c` ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit `b97e92574c` ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit `d3b178adb3` ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by `b97e92574c` become necessary. - commit `3c462c880b` ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So `cf921cc19c` became useless after `d3b178adb3`, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Heming Zhao	e68cb83a57	md/bitmap: don't set sb values if can't pass sanity check If bitmap area contains invalid data, kernel will crash then mdadm triggers "Segmentation fault". This is cluster-md speical bug. In non-clustered env, mdadm will handle broken metadata case. In clustered array, only kernel space handles bitmap slot info. But even this bug only happened in clustered env, current sanity check is wrong, the code should be changed. How to trigger: (faulty injection) dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm -Ss echo aaa > magic.txt == below modifying slot 2 bitmap data == dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize mdadm -A /dev/md0 /dev/sda /dev/sdb == kernel crashes. mdadm outputs "Segmentation fault" == Reason of kernel crash: In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T() trigger "divide error". Crash log: kernel: md: md0 stopped. kernel: md/raid1:md0: not clean -- starting background reconstruction kernel: md/raid1:md0: active with 2 out of 2 mirrors kernel: dlm: ... ... kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1 kernel: md0: invalid bitmap file superblock: bad magic kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2 kernel: md-cluster: Could not gather bitmaps from slot 2 kernel: divide error: 0000 [#1] SMP NOPTI kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246 kernel: ... ... kernel: Call Trace: kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0] kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a] kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0] kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a] kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a] kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a] kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a] kernel: ? blkdev_ioctl+0xb1/0x2b0 kernel: block_ioctl+0x3b/0x40 kernel: __x64_sys_ioctl+0x7f/0xb0 kernel: do_syscall_64+0x59/0x80 kernel: ? exit_to_user_mode_prepare+0x1ab/0x230 kernel: ? syscall_exit_to_user_mode+0x18/0x40 kernel: ? do_syscall_64+0x69/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f4a15fa722b kernel: ... ... kernel: ---[ end trace 8afa7612f559c868 ]--- kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Xiaomeng Tong	64c54d9244	md: fix an incorrect NULL check in md_reload_sb The bug is here: if (!rdev \|\| rdev->desc_nr != nr) { The list iterator value 'rdev' will always be set and non-NULL by rdev_for_each_rcu(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found (In fact, it will be a bogus pointer to an invalid struct object containing the HEAD). Otherwise it will bypass the check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'pdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: `70bcecdb15` ("md-cluster: Improve md_reload_sb to be less error prone") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Xiaomeng Tong	fc8738343e	md: fix an incorrect NULL check in does_sb_need_changing The bug is here: if (!rdev) The list iterator value 'rdev' will always be set and non-NULL by rdev_for_each(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found. Otherwise it will bypass the NULL check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'rdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: `2aa82191ac` ("md-cluster: Perform a lazy update") Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Mariusz Tkaczyk	57668f0a4c	raid5: introduce MD_BROKEN Raid456 module had allowed to achieve failed state. It was fixed by `fb73b357fb` ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed. As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved. This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side. Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/ fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 & echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove [ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds. Cc: stable@vger.kernel.org Fixes: `fb73b357fb` ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Mariusz Tkaczyk	9631abdbf4	md: Set MD_BROKEN for RAID1 and RAID10 There is no direct mechanism to determine raid failure outside personality. It is done by checking rdev->flags after executing md_error(). If "faulty" flag is not set then -EBUSY is returned to userspace. -EBUSY means that array will be failed after drive removal. Mdadm has special routine to handle the array failure and it is executed if -EBUSY is returned by md. There are at least two known reasons to not consider this mechanism as correct: 1. drive can be removed even if array will be failed[1]. 2. -EBUSY seems to be wrong status. Array is not busy, but removal process cannot proceed safe. -EBUSY expectation cannot be removed without breaking compatibility with userspace. In this patch first issue is resolved by adding support for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in next commit. The idea is to set the MD_BROKEN if we are sure that raid is in failed state now. This is done in each error_handler(). In md_error() MD_BROKEN flag is checked. If is set, then -EBUSY is returned to userspace. As in previous commit, it causes that #mdadm --set-faulty is able to fail array. Previously proposed workaround is valid if optional functionality[1] is disabled. [1] commit 9a567843f7ce("md: allow last device to be forcibly removed from RAID1/RAID10.") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:34 -07:00
Linus Torvalds	8467f9e349	block-5.18-2022-04-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJjYKoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqOtEADHSqgTsfwwsYOq15ups32oEdoFtPJ0Fl1C RUWLOzqskk2AV9Foj9TeGSqp448AW+sgV9P6NjrUR7nrefRtYwEdTx2ou1oEwaq0 wl/wsZGg9BiX2w6kBSDRxoYZL220n4B2iscg76VOb6+9VeLdD5s/qeL2P912Bre8 zAEfdGvmLK7gF/6Oxo0fqpzwWyQ7O8+wBTBbpqcwZPImPFsbuWLMG4fgFINb20fD UsVIXWEyV1UlWOk2v4GJl3rToR1f6l7tVT9gUYrRIt9iPTDH86rHkAiLg4QCbfZA 556xHivv5Dw91JJfGDCeiMtq9kjeIRiwpOTNiP1Av5MTM0TC8BaXtkpfwhO5r3iZ tbc2M/7Q8pOM0ati+HWBOp6Gi9hjzqwObGllKJnD1MZ/ESx88z2vlIcQYyBPkNOO giNmP52sgN/5ToMqW25AFbSHJ5Tb+p0epEbbJOv3m+0sSR2TnQ9/DweDBbsOs3Yw KyaHb1e7WxqtIn+4t1r2GgXdDJkQrMcdonkLu+i4Yoz0v4rfRGfg89F+LNylq1kI hng/xPsbgSAjn1j/1EkCKKjX0+R6Blhc/I99EucD9zaOmHYnx6hO0KySJXQzIAo7 lalfsE5bx1LWR1yC/KJ0To6TXQo9Onf5IIUPs1duiHnTFYgpM9z5sOmcyRL+3OtF Rvzw0yR37Q== =hxoY -----END PGP SIGNATURE----- Merge tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Just two small regression fixes for bcache" * tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block: bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook() bcache: put bch_bio_map() back to correct location in journal_write_unlocked()	2022-04-23 09:46:44 -07:00
Coly Li	9dca4168a3	bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook() Commit `abfc426d1b` ("block: pass a block_device to bio_clone_fast") calls the modified bio_alloc_clone() in bcache code as: bio_init_clone(bio->bi_bdev, bio, orig_bio, GFP_NOIO); But the first parameter is wrong, where bio->bi_bdev should be orig_bio->bi_bdev. The wrong bi_bdev panics the kernel when submitting cache bio. This patch fixes the wrong bdev parameter usage and avoid the panic. Fixes: `abfc426d1b` ("block: pass a block_device to bio_clone_fast") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220419160425.4148-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-19 11:28:17 -06:00
Coly Li	ff2695e52c	bcache: put bch_bio_map() back to correct location in journal_write_unlocked() Commit `a7c50c9404` ("block: pass a block_device and opf to bio_reset") moves bch_bio_map() inside journal_write_unlocked() next to the location where the modified bio_reset() was called. This change is wrong because calling bch_bio_map() immediately after bio_reset(), a BUG_ON(!bio->bi_iter.bi_size) inside bch_bio_map() will be triggered and panic the kernel. This patch puts bch_bio_map() back to its original correct location in journal_write_unlocked() and avoid the BUG_ON(). Fixes: `a7c50c9404` ("block: pass a block_device and opf to bio_reset") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419160425.4148-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-19 11:28:17 -06:00
Christoph Hellwig	44abff2c0b	block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:49:59 -06:00
Christoph Hellwig	70200574cc	block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:49:59 -06:00
Christoph Hellwig	cf0fbf894b	block: add a bdev_max_discard_sectors helper Add a helper to query the number of sectors support per each discard bio based on the block device and use this helper to stop various places from poking into the request_queue to see if discard is supported and if so how much. This mirrors what is done e.g. for write zeroes as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:49:59 -06:00
Christoph Hellwig	36d254893a	block: add a bdev_stable_writes helper Add a helper to check the stable writes flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:49:59 -06:00
Christoph Hellwig	10f0d2a517	block: add a bdev_nonrot helper Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:49:59 -06:00
Christoph Hellwig	066ff57101	block: turn bio_kmalloc into a simple kmalloc wrapper Remove the magic autofree semantics and require the callers to explicitly call bio_init to initialize the bio. This allows bio_free to catch accidental bio_put calls on bio_init()ed bios as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 19:30:41 -06:00
Shin'ichiro Kawasaki	92b914e29a	dm: fix bio length of empty flush The commit `92986f6b4c` ("dm: use bio_clone_fast in alloc_io/alloc_tio") removed bio_clone_fast() call from alloc_tio() when ci->io->tio is available. In this case, ci->bio is not copied to ci->io->tio.clone. This is fine since init_clone_info() sets same values to ci->bio and ci->io->tio.clone. However, when incoming bios have REQ_PREFLUSH flag, __send_empty_flush() prepares a zero length bio on stack and set it to ci->bio. At this time, ci->io->tio.clone still keeps non-zero length. When alloc_tio() chooses this ci->io->tio.clone as the bio to map, it is passed to targets as non-empty flush bio. It causes bio length check failure in dm-zoned and unexpected operation such as dm_accept_partial_bio() call. To avoid the non-empty flush bio, set zero length to ci->io->tio.clone in __send_empty_flush(). Fixes: `92986f6b4c` ("dm: use bio_clone_fast in alloc_io/alloc_tio") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-15 16:16:09 -04:00
Mike Snitzer	7dd06a2548	dm: allow dm_accept_partial_bio() for dm_io without duplicate bios The intent behind commit `e6fc9f62ce` ("dm: flag clones created by __send_duplicate_bios") was to formally disallow the use of dm_accept_partial_bio() where it simply isn't possible -- due to constraint that multiple bios cannot meaningfully update a shared tio->len_ptr. But that commit went too far and disallowed the case where "abormal" IO (e.g. WRITE_ZEROES) is only using a single bio. Fix this by not marking a dm_io with a single dm_target_io (and bio), that happens to be created by __send_duplicate_bios, as DM_TIO_IS_DUPLICATE_BIO. Also remove 'unsigned *len' parameter from alloc_multiple_bios(). This commit fixes a dm_accept_partial_bio() BUG_ON() with dm-zoned when a WRITE_ZEROES bio is issued. Fixes: `655f3aad7a` ("dm: switch dm_target_io booleans over to proper flags") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-14 20:01:54 -04:00
Mike Snitzer	73d7b06e90	dm zone: fix NULL pointer dereference in dm_zone_map_bio Commit `0fbb4d93b3` ("dm: add dm_submit_bio_remap interface") changed the alloc_io() function to delay the initialization of struct dm_io's orig_bio member, leaving it NULL until after the dm_io and associated user submitted bio is processed by __split_and_process_bio(). This change causes a NULL pointer dereference in dm_zone_map_bio() when the original user bio is inspected to detect the need for zone append command emulation. Fix this NULL pointer by updating dm_zone_map_bio() to not access ->orig_bio when the same info can be accessed from the clone of the ->orig_bio _before_ any ->map processing. Save off the bio_op() and bio_sectors() for the clone and then use the saved orig_bio_details as needed. Fixes: `0fbb4d93b3` ("dm: add dm_submit_bio_remap interface") Reported-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-13 13:22:17 -04:00
Khazhismel Kumykov	ce40426fdc	dm mpath: only use ktime_get_ns() in historical selector Mixing sched_clock() and ktime_get_ns() usage will give bad results. Switch hst_select_path() from using sched_clock() to ktime_get_ns(). Also rename path_service_time()'s 'sched_now' variable to 'now'. Fixes: `2613eab119` ("dm mpath: add Historical Service Time Path Selector") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-13 13:22:16 -04:00
Mikulas Patocka	08c1af8f1c	dm integrity: fix memory corruption when tag_size is less than digest size It is possible to set up dm-integrity in such a way that the "tag_size" parameter is less than the actual digest size. In this situation, a part of the digest beyond tag_size is ignored. In this case, dm-integrity would write beyond the end of the ic->recalc_tags array and corrupt memory. The corruption happened in integrity_recalc->integrity_sector_checksum->crypto_shash_final. Fix this corruption by increasing the tags array so that it has enough padding at the end to accomodate the loop in integrity_recalc() being able to write a full digest size for the last member of the tags array. Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-13 12:38:49 -04:00
Linus Torvalds	fe35fdb305	- Fix DM integrity shrink crash due to journal entry not being marked unused. - Fix DM bio polling to handle possibility that underlying device(s) return BLK_STS_AGAIN during submission. - Fix dm_io and dm_target_io flags race condition on Alpha. - Add some pr_err debugging to help debug cases when DM ioctl structure is corrupted. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmJHUFIACgkQxSPxCi2d A1qclwf/binqM95fkE2PW6Uvqm7EEsiWXj8Bu+5NQqvlhrLhVTeMdxOKYLClOZ0o eY0TLmkCCtrzveHZYbVcV/HM/tUY5qFT4gZbsXwW+HTSxOPeiOx5ZOOthP3kZbU0 zKpu4N1udfg7m8wSnE7o0cJKBrea6fUf5YDpUT+9EmMhoL0a8IJ3Y1TSJZ/CDJEm OptDS2QH+pFeiIt+lCPL18BT+KnW48Cofpx6ZAGnphLTfSeu8ZiMCew0LhQcepCf L6jfnQQmyGYBydTVSTzP3bbP/GNxPjDZuMT/z/sdMGrQmIxZT+/UqCl+kzD/jC7q 1Sjha+8gYo5a2cMTJZSKmFBKjvLYJQ== =dWLx -----END PGP SIGNATURE----- Merge tag 'for-5.18/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM integrity shrink crash due to journal entry not being marked unused. - Fix DM bio polling to handle possibility that underlying device(s) return BLK_STS_AGAIN during submission. - Fix dm_io and dm_target_io flags race condition on Alpha. - Add some pr_err debugging to help debug cases when DM ioctl structure is corrupted. * tag 'for-5.18/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix bio polling to handle possibile BLK_STS_AGAIN dm: fix dm_io and dm_target_io flags race condition on Alpha dm integrity: set journal entry unused when shrinking device dm ioctl: log an error if the ioctl structure is corrupted	2022-04-01 15:57:27 -07:00
Ming Lei	5291984004	dm: fix bio polling to handle possibile BLK_STS_AGAIN Expanded testing of DM's bio polling support (using more fio threads to dm-linear ontop of null_blk) exposed the possibility for polled bios to hang (repeatedly polling in io_uring) when null_blk responds with BLK_STS_AGAIN (due to lack of resources): 1) io_complete_rw_iopoll() is called from blkdev_bio_end_io_async() to notify kiocb is done, that is the completion interface between block layer and io_uring 2) io_complete_rw_iopoll() is called from io_do_iopoll() 3) dm returns BLK_STS_AGAIN for one bio (on behalf of underlying driver), then io_complete_rw_iopoll is called, but io_do_iopoll() doesn't handle -EAGAIN at all (due to logic in io_rw_should_reissue) 4) reason for dm's BLK_STS_AGAIN is underlying null_blk driver ran out of requests (easier to reproduce by setting low hw_queue_depth). 5) dm should handle BLK_STS_AGAIN for POLLED underlying IO, and may retry in dm layer. This fix adds REQ_POLLED specific BLK_STS_AGAIN handling to dm_io_complete() that clears REQ_POLLED and requeues the bio to DM using queue_io(). Fixes: `b99fdcdc36` ("dm: support bio polling") Signed-off-by: Ming Lei <ming.lei@redhat.com> [snitzer: revised header, reused dm_io_complete's REQ_POLLED case] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-01 13:23:12 -04:00
Mikulas Patocka	aad5b23ebf	dm: fix dm_io and dm_target_io flags race condition on Alpha Early alpha processors cannot write a single byte or short; they read 8 bytes, modify the value in registers and write back 8 bytes. This could cause race condition in the structure dm_io - if the fields flags and io_count are modified simultaneously. Fix this bug by using 32-bit flags if we are on Alpha and if we are compiling for a processor that doesn't have the byte-word-extension. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `bd4a6dd241` ("dm: reduce size of dm_io and dm_target_io structs") [snitzer: Jens allowed this change since Mikulas owns a relevant Alpha!] Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-01 13:19:27 -04:00
Mikulas Patocka	cc09e8a9de	dm integrity: set journal entry unused when shrinking device Commit `f6f72f32c2` ("dm integrity: don't replay journal data past the end of the device") skips journal replay if the target sector points beyond the end of the device. Unfortunatelly, it doesn't set the journal entry unused, which resulted in this BUG being triggered: BUG_ON(!journal_entry_is_unused(je)) Fix this by calling journal_entry_set_unused() for this case. Fixes: `f6f72f32c2` ("dm integrity: don't replay journal data past the end of the device") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> [snitzer: revised header] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-01 10:31:23 -04:00
Mikulas Patocka	dbdcc906d9	dm ioctl: log an error if the ioctl structure is corrupted This will help triage bugs when userspace is passing invalid ioctl structure to the kernel. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> [snitzer: log errors using DMERR instead of DMWARN] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-04-01 10:29:43 -04:00
Linus Torvalds	266d17a8c0	Driver core changes for 5.18-rc1 Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c with your tree, the merge conflict should be easy (take all the changes). All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU wUUCGqKcOnTJEcW6dMPh =1VVi -----END PGP SIGNATURE----- Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ...	2022-03-28 12:41:28 -07:00
Linus Torvalds	561593a048	for-5.18/write-streams-2022-03-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+ Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa VaTzP71C1Q== =efaq -----END PGP SIGNATURE----- Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint	2022-03-26 11:51:46 -07:00
Linus Torvalds	6f2689a766	SCSI misc on 20220324 This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYjzDQyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQMYAQDEWUGV 6U0+736AHVtOfiMNfiRN79B1HfXVoHvemnPcTwD/UlndwFfy/3GGOtoZmqEpc73J Ec1HDuUCE18H1H2QAh0= =/Ty9 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits) scsi: scsi_ioctl: Drop needless assignment in sg_io() scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn() scsi: lpfc: Copyright updates for 14.2.0.0 patches scsi: lpfc: Update lpfc version to 14.2.0.0 scsi: lpfc: SLI path split: Refactor BSG paths scsi: lpfc: SLI path split: Refactor Abort paths scsi: lpfc: SLI path split: Refactor SCSI paths scsi: lpfc: SLI path split: Refactor CT paths scsi: lpfc: SLI path split: Refactor misc ELS paths scsi: lpfc: SLI path split: Refactor VMID paths scsi: lpfc: SLI path split: Refactor FDISC paths scsi: lpfc: SLI path split: Refactor LS_RJT paths scsi: lpfc: SLI path split: Refactor LS_ACC paths scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4 scsi: lpfc: SLI path split: Refactor lpfc_iocbq scsi: lpfc: Use kcalloc() ...	2022-03-24 19:37:53 -07:00
Linus Torvalds	b1f8ccdaae	- Significant refactoring and fixing of how DM core does bio-based IO accounting with focus on fixing wildly inaccurate IO stats for dm-crypt (and other DM targets that defer bio submission in their own workqueues). End result is proper IO accounting, made possible by targets being updated to use the new dm_submit_bio_remap() interface. - Add hipri bio polling support (REQ_POLLED) to bio-based DM. - Reduce dm_io and dm_target_io structs so that a single dm_io (which contains dm_target_io and first clone bio) weighs in at 256 bytes. For reference the bio struct is 128 bytes. - Various other small cleanups, fixes or improvements in DM core and targets. - Update MAINTAINERS with my kernel.org email address to allow distinction between my "upstream" and "Red" Hats. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmI7ghIACgkQxSPxCi2d A1q04wgAzaRu186WjfCO8MK0uyv3S52Rw1EsgYealAqoPwQJ9KkW2icvjtwRL+fJ 1+w6qE/Da6QdwXj9lGtp1XIXJFipNJSw3PSaE/tV2cXiBemZlzJ5vR6F6dfeYKmV /sGas46H2l+aD4Xr7unUmcN/AYrNIFtnucClY3+DlJFPesXQQc9a/XmL9RX9MrN4 MS9wLkh/5QSG3zReEct/4GVmNSJAjFfLkkeFHtLN82jvvDmnszRT5+aJ06WkXeOz OZmQfOPnJv5MnFUz9DOaRb/fTCoyxzxLnNM5Lt3jyFPk9Jf8Qz9TJ2rgskxsE83u UsCD/Y/QAdDcrRVB5SS6+yx4AS6uSA== =cinj -----END PGP SIGNATURE----- Merge tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Significant refactoring and fixing of how DM core does bio-based IO accounting with focus on fixing wildly inaccurate IO stats for dm-crypt (and other DM targets that defer bio submission in their own workqueues). End result is proper IO accounting, made possible by targets being updated to use the new dm_submit_bio_remap() interface. - Add hipri bio polling support (REQ_POLLED) to bio-based DM. - Reduce dm_io and dm_target_io structs so that a single dm_io (which contains dm_target_io and first clone bio) weighs in at 256 bytes. For reference the bio struct is 128 bytes. - Various other small cleanups, fixes or improvements in DM core and targets. - Update MAINTAINERS with my kernel.org email address to allow distinction between my "upstream" and "Red" Hats. * tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits) dm: consolidate spinlocks in dm_io struct dm: reduce size of dm_io and dm_target_io structs dm: switch dm_target_io booleans over to proper flags dm: switch dm_io booleans over to proper flags dm: update email address in MAINTAINERS dm: return void from __send_empty_flush dm: factor out dm_io_complete dm cache: use dm_submit_bio_remap dm: simplify dm_sumbit_bio_remap interface dm thin: use dm_submit_bio_remap dm: add WARN_ON_ONCE to dm_submit_bio_remap dm: support bio polling block: add ->poll_bio to block_device_operations dm mpath: use DMINFO instead of printk with KERN_INFO dm: stop using bdevname dm-zoned: remove the ->name field in struct dmz_dev dm: remove unnecessary local variables in __bind dm: requeue IO if mapping table not yet available dm io: remove stale comment block for dm_io() dm thin metadata: remove unused dm_thin_remove_block and __remove ...	2022-03-24 19:25:24 -07:00
Linus Torvalds	69d1dea852	for-5.18/drivers-2022-03-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0/QUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn8GEACRVxJaJV5qjZfoFAQKoAWJEtquwjeARyB+ 0V8ROWHDWHSacdug9wBytayiS1lz2zmUHJ6YXyts2dn0v6CrK4s8yGzk5G/RgH6+ 6M3GmBKjj+r1DfE8L3OoQWkDR1JFPuFxXTG/uBd7fBY2Excih1Z0D2lpspMleIRf w8zBrlWrWH8lZlm6HF3fadjEoiWhOM5F4Ofz3eg/PAQrHuD06z8hjQgMeR0jQVzw bWF9jrdNIplxRjNWIwCTsQRM+z5KQhUGwDODJjIwdQtVaKSt9D99ZbeKTudlslQ2 zrizsCq8P1RjBPcrA45FV6QnT9DIRRGrYzHD63qC6fDae34rbzdSHUwRMP2XSxo8 +hT1AzGypiBauODTPzHFtTskaQ0KibLznEanChh/ThySmNYcEVAljSx3Z5Vo81J+ IqJYK2m3RESCFruy9w3U/P7qiXZmqYldPfjxAKq8ucg6x1PU3XRAVm7SI/i4l75D Crk1ujj2LJgsyxL6qMrK3XUavl1SJdzWeFSarcCt3m4m11EWWfYzmG8Yn8OE2CEZ a2CAyDsRi8CZ3hvkaMwigL4wBJjrrig8vyIgok3VrfCmYlNNqMQqM5Rw7vzjR3v1 cKewI3rQjkFXEaveIXyGPTI/0Da4cT0DOfn/Mws9MDUXNPlFMNEDUZkPuzMywiTB 2SWDLTe77g== =993h -----END PGP SIGNATURE----- Merge tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - NVMe updates via Christoph: - add vectored-io support for user-passthrough (Kanchan Joshi) - add verbose error logging (Alan Adamson) - support buffered I/O on block devices in nvmet (Chaitanya Kulkarni) - central discovery controller support (Martin Belanger) - fix and extended the globally unique idenfier validation (Christoph) - move away from the deprecated IDA APIs (Sagi Grimberg) - misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin, Chaitanya Kulkarni) - add lockdep annotations for in-kernel sockets (Chris Leech) - use vmalloc for ANA log buffer (Hannes Reinecke) - kerneldoc fixes (Chaitanya Kulkarni) - cleanups (Guoqing Jiang, Chaitanya Kulkarni, Christoph) - warn about shared namespaces without multipathing (Christoph) - MD updates via Song with a set of cleanups (Christoph, Mariusz, Paul, Erik, Dirk) - loop cleanups and queue depth configuration (Chaitanya) - null_blk cleanups and fixes (Chaitanya) - Use descriptive init/exit names in virtio_blk (Randy) - Use bvec_kmap_local() in drivers (Christoph) - bcache fixes (Mingzhe) - xen blk-front persistent grant speedups (Juergen) - rnbd fix and cleanup (Gioh) - Misc fixes (Christophe, Colin) * tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block: (76 commits) virtio_blk: eliminate anonymous module_init & module_exit nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH nvme: remove nvme_alloc_request and nvme_alloc_request_qid nvme: cleanup how disk->disk_name is assigned nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate nvmet: use snprintf() with PAGE_SIZE in configfs nvmet: don't fold lines nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport nvme-tcp: lockdep: annotate in-kernel sockets nvme-tcp: don't fold the line nvme-tcp: don't initialize ret variable nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio nvme-multipath: use vmalloc for ANA log buffer xen/blkfront: speed up purge_persistent_grants() raid5: initialize the stripe_head embeeded bios as needed raid5-cache: statically allocate the recovery ra bio raid5-cache: fully initialize flush_bio when needed raid5-ppl: fully initialize the bio in ppl_new_iounit ...	2022-03-21 17:16:01 -07:00
Linus Torvalds	616355cc81	for-5.18/block-2022-03-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0+GcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprUpD/9aTJEnj7VCw7UouSsg098sdjtoy9ilslU3 ew47K8CIXHbCB4CDqLnFyvCwAdG1XGgS+fUmFAxvTr29R9SZeS5d+bXL6sZzEo0C bwxsJy9MM2QRtMvB+giAt1myXbwB8cG+ketMBWXqwXXRHRzPbbQfMZia7FqWMnfY KQanH9IwYHp1oa5U/W6Qcjm4oCnLgBMRwqByzUCtiF3y9qgaLkK+3IgkNwjJQjLA DTeUJ/9CgxGQQbzA+LPktbw2xfTqiUfcKq0mWx6Zt4wwNXn1ClqUDUXX6QSM8/5u 3OimbscSkEPPTIYZbVBPkhFnAlQb4JaJEgOrbXvYKVV2Dh+eZY81XwNeE/E8gdBY TnHOTOCjkN/4sR3hIrWazlJzPLdpPA0eOYrhguCraQsX9mcsYNxlJ9otRv/Ve99g uqL0RZg3+NoK84fm79FCGy/ZmPQJvJttlBT9CKVwylv/Lky42xWe7AdM3OipKluY 2nh+zN5Ai7WxZdTKXQFRhCSWfWQ+1qW51tB3dcGW+BooZr/oox47qKQVcHsEWbq1 RNR45F5a4AuPwYUHF/P36WviLnEuq9AvX7OTTyYOplyVQohKIoDXp9chVzLNzBiZ KBR00W6MLKKKN+8foalQWgNyb2i2PH7Ib4xRXvXj/22Vwxg5UmUoBmSDSas9SZUS +dMo7CtNgA== =DpgP -----END PGP SIGNATURE----- Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...	2022-03-21 16:48:55 -07:00
Mike Snitzer	4d7bca13dd	dm: consolidate spinlocks in dm_io struct No reason to have separate startio_lock and endio_lock given endio_lock could be used during submission anyway. This change leaves the dm_io struct weighing in at 256 bytes (down from 272 bytes, so saves a cacheline). Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-03-21 14:15:36 -04:00
Mike Snitzer	bd4a6dd241	dm: reduce size of dm_io and dm_target_io structs Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-03-21 14:15:35 -04:00
Mike Snitzer	655f3aad7a	dm: switch dm_target_io booleans over to proper flags Add flags to dm_target_io and manage them using the same pattern used for bi_flags in struct bio. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-03-21 14:15:35 -04:00
Mike Snitzer	82f6cdcc36	dm: switch dm_io booleans over to proper flags Add flags to dm_io and manage them using the same pattern used for bi_flags in struct bio. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-03-21 14:15:34 -04:00
Mike Snitzer	332f2b1e73	dm: return void from __send_empty_flush Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 16:52:11 -05:00
Mike Snitzer	e27363472f	dm: factor out dm_io_complete Optimizes dm_io_dec_pending() slightly by avoiding local variables. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 16:52:10 -05:00
Mike Snitzer	69596f555b	dm cache: use dm_submit_bio_remap Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 13:44:57 -05:00
Mike Snitzer	b7f8dff098	dm: simplify dm_sumbit_bio_remap interface Remove the from_wq argument from dm_sumbit_bio_remap(). Eliminates the need for dm_sumbit_bio_remap() callers to know whether they are calling for a workqueue or from the original dm_submit_bio(). Add map_task to dm_io struct, record the map_task in alloc_io and clear it after all target ->map() calls have completed. Update dm_sumbit_bio_remap to check if 'current' matches io->map_task rather than rely on passed 'from_rq' argument. This change really simplifies the chore of porting each DM target to using dm_sumbit_bio_remap() because there is no longer the risk of programming error by not completely knowing all the different contexts a particular method that calls dm_sumbit_bio_remap() might be used in. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 13:44:56 -05:00
Mike Snitzer	a92512819b	dm thin: use dm_submit_bio_remap Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 13:44:55 -05:00
Mike Snitzer	0a8e9599b9	dm: add WARN_ON_ONCE to dm_submit_bio_remap If a target uses dm_submit_bio_remap() it should set ti->accounts_remapped_io. Also, switch dm_start_io_acct() WARN_ON to WARN_ON_ONCE. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-10 13:44:43 -05:00
Ming Lei	b99fdcdc36	dm: support bio polling Support bio polling (REQ_POLLED) in the following approach: 1) only support io polling on normal READ/WRITE, and other abnormal IOs still fallback to IRQ mode, so the target io (and DM's clone bio) is exactly inside the dm io. 2) hold one refcnt on io->io_count after submitting this dm bio with REQ_POLLED 3) support dm native bio splitting, any dm io instance associated with current bio will be added into one list which head is bio->bi_private which will be recovered before ending this bio 4) implement .poll_bio() callback, call bio_poll() on the single target bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call dm_io_dec_pending() after the target io is done in .poll_bio() 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, which is based on Jeffle's previous patch. These changes are good for a 30-35% IOPS improvement for polled IO. For detailed test results please see (Jens, thanks for testing!): https://listman.redhat.com/archives/dm-devel/2022-March/049868.html or https://marc.info/?l=linux-block&m=164684246214700&w=2 Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-09 12:21:56 -05:00
Christoph Hellwig	03a6b195e8	raid5: initialize the stripe_head embeeded bios as needed Use bio_init to initialize the bios when needed to the full state instead of a partial initialization plus later setting of dev and op and bio_reset. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 22:55:13 -08:00
Christoph Hellwig	89f94b6440	raid5-cache: statically allocate the recovery ra bio There is no need to preallocate the bio and reset it when use. Just allocate it on-stack and use a bvec places next to the pages used for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 22:55:09 -08:00
Christoph Hellwig	0dd00cba99	raid5-cache: fully initialize flush_bio when needed Stop using bio_reset and just initialize the bio fully when needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 22:55:04 -08:00
Christoph Hellwig	9f7c3f837a	raid5-ppl: fully initialize the bio in ppl_new_iounit We have all the information to pass the bdev and op directly to bio_init, so do that. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 22:55:00 -08:00
Eric Dumazet	7d959f6e97	md: use msleep() in md_notify_reboot() Calling mdelay(1000) from process context, even while a reboot is in progress, does not make sense. Using msleep() allows other threads to make progress. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 15:20:21 -08:00
Mariusz Tkaczyk	daae161fd2	md: raid1/raid10: drop pending_cnt Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue limitation for RAID1 and RAID10"). Remove them from all code (conf and plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move definition of raid1_plug_cb to common raid1-10 definitions and use it for RAID10 too. Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>	2022-03-08 15:16:54 -08:00
Christoph Hellwig	c75e707fe1	block: remove the per-bio/request write hint With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 12:45:57 -07:00
Jens Axboe	b46bebaf2a	Merge branch 'for-5.18/drivers' into for-5.18/write-streams * for-5.18/drivers: (51 commits) bcache: fixup multiple threads crash bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing floppy: use memcpy_{to,from}_bvec drbd: use bvec_kmap_local in recv_dless_read drbd: use bvec_kmap_local in drbd_csum_bio bcache: use bvec_kmap_local in bio_csum nvdimm-btt: use bvec_kmap_local in btt_rw_integrity nvdimm-blk: use bvec_kmap_local in nd_blk_rw_integrity zram: use memcpy_from_bvec in zram_bvec_write zram: use memcpy_to_bvec in zram_bvec_read aoe: use bvec_kmap_local in bvcpy iss-simdisk: use bvec_kmap_local in simdisk_submit_bio nvme: check that EUI/GUID/UUID are globally unique nvme: check for duplicate identifiers earlier nvme: fix the check for duplicate unique identifiers nvme: cleanup __nvme_check_ids nvme: remove nssa from struct nvme_ctrl nvme: explicitly set non-error for directives nvme: expose cntrltype and dctype through sysfs nvme: send uevent on connection up ...	2022-03-07 12:44:39 -07:00
Jens Axboe	13400b1454	Merge branch 'for-5.18/block' into for-5.18/write-streams * for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...	2022-03-07 12:44:37 -07:00
Christoph Hellwig	c7dec4623c	raid5-ppl: stop using bio_devname Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 06:42:33 -07:00
Christoph Hellwig	ac483eb375	raid1: stop using bio_devname Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 06:42:33 -07:00
Christoph Hellwig	ee1925bd83	md-multipath: stop using bio_devname Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 06:42:33 -07:00
Christoph Hellwig	0a806cfde8	dm-integrity: stop using bio_devname Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 06:42:33 -07:00
Christoph Hellwig	6667171965	dm-crypt: stop using bio_devname Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-07 06:42:33 -07:00
Mingzhe Zou	887554ab96	bcache: fixup multiple threads crash When multiple threads to check btree nodes in parallel, the main thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag: wait_event_interruptible(check_state->wait, atomic_read(&check_state->started) == 0 \|\| test_bit(CACHE_SET_IO_DISABLE, &c->flags)); However, the bch_btree_node_read and bch_btree_node_read_done maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE will be set. If the flag already set, the main thread return error. At the same time, maybe some threads still running and read NULL pointer, the kernel will crash. This patch change the event wait condition, the main thread must wait for all threads to stop. Fixes: `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Coly Li <colyli@suse.de>	2022-03-06 22:33:45 +08:00
Mingzhe Zou	7b1002f7cf	bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing When attaching a cached device (a.k.a backing device) to a cache device, bch_sectors_dirty_init() is called to count dirty sectors and stripes (see what bcache_dev_sectors_dirty_add() does) on the cache device. When bcache_dev_sectors_dirty_add() is called, set_bit(stripe, d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes) operation will always be performed. In full_dirty_stripes, each 1bit represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512), and each CPU cache line=64B=512bit=2048MB. When 20 threads process a cached disk with 100G dirty data, a single thread processes about 23M at a time, and 20 threads total 460M. These full_dirty_stripes bits corresponding to the 460M data is likely to fall in the same CPU cache line. When one of these threads performs a set_bit or clear_bit operation, the same CPU cache line of other threads will become invalid and must read the full_dirty_stripes from the main memory again. Compared with single thread, the time of a bcache_dev_sectors_dirty_add() call is increased by about 50 times in our test (100G dirty data, 20 threads, bcache_dev_sectors_dirty_add() is called more than 20 million times). This patch tries to test_bit before set_bit or clear_bit operation. Therefore, a lot of force set and clear operations will be avoided, and most of bcache_dev_sectors_dirty_add() calls will only read CPU cache line. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de>	2022-03-06 22:33:37 +08:00
Christoph Hellwig	07fee7aba5	bcache: use bvec_kmap_local in bio_csum Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20220303111905.321089-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-04 12:29:21 -07:00
Mike Snitzer	168678d765	dm mpath: use DMINFO instead of printk with KERN_INFO Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-02 12:20:22 -05:00
Christoph Hellwig	385411ffba	dm: stop using bdevname Just use the %pg format specifier instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-02 12:15:54 -05:00
Christoph Hellwig	977ff73e64	dm-zoned: remove the ->name field in struct dmz_dev Just use the %pg format specifier to print the block device name directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-03-02 12:15:35 -05:00
Greg Kroah-Hartman	4a248f85b3	Merge 5.17-rc6 into driver-core-next We need the driver core fix in here as well for future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-02-28 07:45:41 +01:00
Christoph Hellwig	a773187e37	scsi: dm: Remove WRITE_SAME support There are no more end-users of REQ_OP_WRITE_SAME left, so we can start deleting it. Link: https://lore.kernel.org/r/20220209082828.2629273-7-hch@lst.de Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2022-02-22 21:11:08 -05:00
Christoph Hellwig	10fa225c33	scsi: md: Remove WRITE_SAME support There are no more end-users of REQ_OP_WRITE_SAME left, so we can start deleting it. Link: https://lore.kernel.org/r/20220209082828.2629273-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2022-02-22 21:11:08 -05:00
Mike Snitzer	f5b4aee10c	dm: remove unnecessary local variables in __bind Also remove empty newline before 'out:' label at end of __bind. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 13:55:53 -05:00
Mike Snitzer	fa247089de	dm: requeue IO if mapping table not yet available Update both bio-based and request-based DM to requeue IO if the mapping table not available. This race of IO being submitted before the DM device ready is so narrow, yet possible for initial table load given that the DM device's request_queue is created prior, that it best to requeue IO to handle this unlikely case. Reported-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 13:55:52 -05:00
Barry Song	a6a4901a5e	dm io: remove stale comment block for dm_io() Commit `7eaceaccab` ("block: remove per-queue plugging") dropped unplug_delay and blk_unplug(). Plus, the current kernel has no fundamental difference between sync_io() and async_io() except sync_io() uses sync_io_complete() as the notify.fn and explicitly calls wait_for_completion_io() to sync. The comment isn't valid any more. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 13:55:51 -05:00
Zhiqiang Liu	75274a4bf2	dm thin metadata: remove unused dm_thin_remove_block and __remove Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 13:55:50 -05:00
Wang Qing	8ca8b1e147	dm thin: use time_is_before_jiffies instead of open coding it Use time_is_before_jiffies() to improve code readability. Signed-off-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 11:30:50 -05:00
Aashish Sharma	6fc5150438	dm crypt: fix get_key_size compiler warning if !CONFIG_KEYS Explicitly convert unsigned int in the right of the conditional expression to int to match the left side operand and the return type, fixing the following compiler warning: drivers/md/dm-crypt.c:2593:43: warning: signed and unsigned type in conditional expression [-Wsign-compare] Fixes: `c538f6ec9f` ("dm crypt: add ability to use keys from the kernel key retention service") Signed-off-by: Aashish Sharma <shraash@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 11:27:56 -05:00
Kirill Tkhai	588b7f5df0	dm: fix use-after-free in dm_cleanup_zoned_dev() dm_cleanup_zoned_dev() uses queue, so it must be called before blk_cleanup_disk() starts its killing: blk_cleanup_disk->blk_cleanup_queue()->kobject_put()->blk_release_queue()-> ->...RCU...->blk_free_queue_rcu()->kmem_cache_free() Otherwise, RCU callback may be executed first and dm_cleanup_zoned_dev() will touch free'd memory: BUG: KASAN: use-after-free in dm_cleanup_zoned_dev+0x33/0xd0 Read of size 8 at addr ffff88805ac6e430 by task dmsetup/681 CPU: 4 PID: 681 Comm: dmsetup Not tainted 5.17.0-rc2+ #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x7d print_address_description.constprop.0+0x1f/0x150 ? dm_cleanup_zoned_dev+0x33/0xd0 kasan_report.cold+0x7f/0x11b ? dm_cleanup_zoned_dev+0x33/0xd0 dm_cleanup_zoned_dev+0x33/0xd0 __dm_destroy+0x26a/0x400 ? dm_blk_ioctl+0x230/0x230 ? up_write+0xd8/0x270 dev_remove+0x156/0x1d0 ctl_ioctl+0x269/0x530 ? table_clear+0x140/0x140 ? lock_release+0xb2/0x750 ? remove_all+0x40/0x40 ? rcu_read_lock_sched_held+0x12/0x70 ? lock_downgrade+0x3c0/0x3c0 ? rcu_read_lock_sched_held+0x12/0x70 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb9/0xf0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fb6dfa95c27 Fixes: `bb37d77239` ("dm: introduce zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 11:13:28 -05:00
Jordy Zomer	cd9c88da17	dm ioctl: prevent potential spectre v1 gadget It appears like cmd could be a Spectre v1 gadget as it's supplied by a user and used as an array index. Prevent the contents of kernel memory from being leaked to userspace via speculative execution by using array_index_nospec. Signed-off-by: Jordy Zomer <jordy@pwning.systems> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 11:04:41 -05:00
Thore Sommer	118f31b496	dm ima: fix wrong length calculation for no_data string All entries measured by dm ima are prefixed by a version string (dm_version=N.N.N). When there is no data to measure, the entire buffer is overwritten with a string containing the version string again and the length of that string is added to the length of the version string. The new length is now wrong because it contains the version string twice. This caused entries like this: dm_version=4.45.0;name=test,uuid=test;table_clear=no_data; \ \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \ current_device_capacity=204808; Signed-off-by: Thore Sommer <public@thson.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 10:42:41 -05:00
Colin Ian King	302f035141	dm cache policy smq: make static read-only array table const The 'table' static array is read-only so it make sense to make it const. Add in the int type to clean up checkpatch warning. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-22 10:35:53 -05:00
Mike Snitzer	c357342186	dm delay: use dm_submit_bio_remap Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:34 -05:00
Mike Snitzer	e5524e128f	dm crypt: use dm_submit_bio_remap Care was taken to support kcryptd_io_read being called from crypt_map or workqueue. Use of an intermediate CRYPT_MAP_READ_GFP gfp_t (defined as GFP_NOWAIT) should protect from maintenance burden if that flag were to change for some reason. Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:34 -05:00
Mike Snitzer	0fbb4d93b3	dm: add dm_submit_bio_remap interface Where possible, switch from early bio-based IO accounting (at the time DM clones each incoming bio) to late IO accounting just before each remapped bio is issued to underlying device via submit_bio_noacct(). Allows more precise bio-based IO accounting for DM targets that use their own workqueues to perform additional processing of each bio in conjunction with their DM_MAPIO_SUBMITTED return from their map function. When a target is updated to use dm_submit_bio_remap() they must also set ti->accounts_remapped_io to true. Use xchg() in start_io_acct(), as suggested by Mikulas, to ensure each IO is only started once. The xchg race only happens if __send_duplicate_bios() sends multiple bios -- that case is reflected via tio->is_duplicate_bio. Given the niche nature of this race, it is best to avoid any xchg performance penalty for normal IO. For IO that was never submitted with dm_bio_submit_remap(), but the target completes the clone with bio_endio, accounting is started then ended and pending_io counter decremented. Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:33 -05:00
Mike Snitzer	e6fc9f62ce	dm: flag clones created by __send_duplicate_bios Formally disallow dm_accept_partial_bio() on clones created by __send_duplicate_bios() because their len_ptr points to a shared unsigned int. __send_duplicate_bios() is only used for flush bios and other "abnormal" bios (discards, writezeroes, etc). And dm_accept_partial_bio() already didn't support flush bios. Also refactor __send_changing_extent_only() to reflect it cannot fail. As such __send_changing_extent_only() can update the clone_info before __send_duplicate_bios() is called to fan-out __map_bio() calls. Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:32 -05:00
Mike Snitzer	300432f58b	dm: reduce dm_io and dm_target_io struct sizes Remove one 4 byte hole in dm_io struct. Remove two 4 byte holes in dm_target_io struct. Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:31 -05:00
Mike Snitzer	018b05ebbf	dm: move duplicate code from callers of alloc_tio into alloc_tio Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:30 -05:00
Mike Snitzer	743598f049	dm: record old_sector in dm_target_io before calling map function Prep for being able to defer trace_block_bio_remap() until when the bio is remapped and submitted by the DM target. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:29 -05:00
Mike Snitzer	77c11720a4	dm: remove legacy code only needed before submit_bio recursion Commit `8615cb65bd` ("dm: remove useless loop in __split_and_process_bio") showcased that we no longer loop. Remove the bio_advance() in __split_and_process_bio() that was only needed when looping was possible. Similarly there is no need to advance the bio, using ci->sector cursor, in __send_duplicate_bios(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:28 -05:00
Mike Snitzer	0119ab14c3	dm: remove unused mapped_device argument from free_tio Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:27 -05:00
Mike Snitzer	5b27b8ddbf	dm: remove impossible BUG_ON in __send_empty_flush The flush_bio in question was just initialized to be empty, so there is no way bio_has_data() will return true. So remove stale BUG_ON(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:26 -05:00
Mike Snitzer	90a2326ede	dm: reduce code duplication in __map_bio Error path code (for handling DM_MAPIO_REQUEUE and DM_MAPIO_KILL) is effectively identical. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:25 -05:00
Mike Snitzer	d41e077ab6	dm: refactor dm_split_and_process_bio a bit Remove needless branching and indentation. Leaves code to catch malformed op_is_zone_mgmt bios (they shouldn't have a payload). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:24 -05:00
Mike Snitzer	66bdaa4302	dm: fold __clone_and_map_data_bio into __split_and_process_bio Fold __clone_and_map_data_bio into its only caller. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:23 -05:00
Mike Snitzer	96c9865cb6	dm: rename split functions Rename __split_and_process_bio to dm_split_and_process_bio. Rename __split_and_process_non_flush to __split_and_process_bio. Also fix a stale comment and whitespace. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:22 -05:00
Mike Snitzer	205649d84c	dm: reorder members in mapped_device struct Improves alignment and groups related members relative to cachelines. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:21 -05:00
Mike Snitzer	0ab30b4079	dm: eliminate copying of dm_io fields in dm_io_dec_pending There is no need for dm_io_dec_pending() to copy dm_io fields anymore now that DM provides its own pending_io counters again. The race documented in commit `d208b89401` ("dm: fix mempool NULL pointer race when completing IO") no longer exists now that block core's in_flight counters aren't used to signal all dm_io is complete. Also, rename {start,end}_io_acct to dm_{start,end}_io_acct. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:36:18 -05:00
Mike Snitzer	0cdb90f0f3	dm stats: fix too short end duration_ns when using precise_timestamps dm_stats_account_io()'s STAT_PRECISE_TIMESTAMPS support doesn't handle the fact that with commit `b879f915bc` ("dm: properly fix redundant bio-based IO accounting") io->start_time _may_ be in the past (meaning the start_io_acct() was deferred until later). Add a new dm_stats_recalc_precise_timestamps() helper that will set/clear a new 'precise_timestamps' flag in the dm_stats struct based on whether any configured stats enable STAT_PRECISE_TIMESTAMPS. And update DM core's alloc_io() to use dm_stats_record_start() to set stats_aux.duration_ns if stats->precise_timestamps is true. Also, remove unused 'last_sector' and 'last_rw' members from the dm_stats struct. Fixes: `b879f915bc` ("dm: properly fix redundant bio-based IO accounting") Cc: stable@vger.kernel.org Co-developed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:35:39 -05:00
Mike Snitzer	8d394bc4ad	dm: fix double accounting of flush with data DM handles a flush with data by first issuing an empty flush and then once it completes the REQ_PREFLUSH flag is removed and the payload is issued. The problem fixed by this commit is that both the empty flush bio and the data payload will account the full extent of the data payload. Fix this by factoring out dm_io_acct() and having it wrap all IO accounting to set the size of bio with REQ_PREFLUSH to 0, account the IO, and then restore the original size. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:35:25 -05:00
Mike Snitzer	9f6dc63376	dm: interlock pending dm_io and dm_wait_for_bios_completion Commit `d208b89401` ("dm: fix mempool NULL pointer race when completing IO") didn't go far enough. When bio_end_io_acct ends the count of in-flight I/Os may reach zero and the DM device may be suspended. There is a possibility that the suspend races with dm_stats_account_io. Fix this by adding percpu "pending_io" counters to track outstanding dm_io. Move kicking of suspend queue to dm_io_dec_pending(). Also, rename md_in_flight_bios() to dm_in_flight_bios() and update it to iterate all pending_io counters. Fixes: `d208b89401` ("dm: fix mempool NULL pointer race when completing IO") Cc: stable@vger.kernel.org Co-developed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-02-21 15:19:08 -05:00
Christoph Hellwig	7a5428dcb7	block: fix surprise removal for drivers calling blk_set_queue_dying Various block drivers call blk_set_queue_dying to mark a disk as dead due to surprise removal events, but since commit `8e141f9eb8` that doesn't work given that the GD_DEAD flag needs to be set to stop I/O. Replace the driver calls to blk_set_queue_dying with a new (and properly documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the only remaining caller. Fixes: `8e141f9eb8` ("block: drain file system I/O on del_gendisk") Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-17 07:54:03 -07:00
Christoph Hellwig	9f9adea718	dm: remove dm_dispatch_clone_request Fold dm_dispatch_clone_request into it's only caller, and use a switch statement to single dispatch for the handling of the different return values from blk_insert_cloned_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-16 19:39:10 -07:00
Christoph Hellwig	8803c89f36	dm: remove useless code from dm_dispatch_clone_request Both ->start_time_ns and the RQF_IO_STAT are set when the request is allocated using blk_mq_alloc_request by dm-mpath in blk_mq_rq_ctx_init. The block layer also ensures ->start_time_ns is only set when actually needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-16 19:39:10 -07:00
Christoph Hellwig	28db4711bf	blk-mq: remove the request_queue argument to blk_insert_cloned_request The request must be submitted to the queue it was allocated for, so remove the extra request_queue argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-16 19:39:10 -07:00
Christoph Hellwig	248c793359	blk-mq: make the blk-mq stacking code optional The code to stack blk-mq drivers is only used by dm-multipath, and will preferably stay that way. Make it optional and only selected by device mapper, so that the buildbots more easily catch abuses like the one that slipped in in the ufs driver in the last merged window. Another positive side effects is that kernel builds without device mapper shrink a little bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-16 19:39:09 -07:00
Christoph Hellwig	abfc426d1b	block: pass a block_device to bio_clone_fast Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:18 -07:00
Christoph Hellwig	a0e8de798d	block: initialize the target bio in __bio_clone_fast All callers of __bio_clone_fast initialize the bio first. Move that initialization into __bio_clone_fast instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:18 -07:00
Christoph Hellwig	92986f6b4c	dm: use bio_clone_fast in alloc_io/alloc_tio Replace open coded bio_clone_fast implementations with the actual helper. Note that the bio allocated as part of the dm_io structure in alloc_io will only actually be used later in alloc_tio, making this earlier cloning of the information safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:18 -07:00
Christoph Hellwig	56b4b5abcd	block: clone crypto and integrity data in __bio_clone_fast __bio_clone_fast should also clone integrity and crypto data, as a clone without those is incomplete. Right now the only caller that can actually support crypto and integrity data (dm) does it manually for the one callchain that supports these, but we better do it properly in the core. Note that all callers except for the above mentioned one also don't need to handle failure at all, given that the integrity and crypto clones are based on mempool allocations that won't fail for sleeping allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:18 -07:00
Christoph Hellwig	3c4b455ef8	dm-cache: remove __remap_to_origin_clear_discard Fold __remap_to_origin_clear_discard into the two callers to prepare for bio cloning refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:18 -07:00
Christoph Hellwig	891fced644	dm: simplify the single bio fast path in __send_duplicate_bios Most targets just need a single flush bio. Open code that case in __send_duplicate_bios without the need to add the bio to a list. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	1d1068cecf	dm: retun the clone bio from alloc_tio Return the clone bio embedded into the tio as that is what the callers actually want. Similar for the free side. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	1561b39610	dm: pass the bio instead of tio to __map_bio This simplifies the callers a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	dc8e2021da	dm: move cloning the bio into alloc_tio Move the call to __bio_clone_fast and the assignment of ->len_ptr from the callers into alloc_tio to prepare for changes to the bio clone API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	8eabf5d0a7	dm: fold __send_duplicate_bios into __clone_and_map_simple_bio Fold __send_duplicate_bios into its only caller to prepare for refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	b1bee79237	dm: fold clone_bio into __clone_and_map_data_bio Fold clone_bio into its only caller to prepare for refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Christoph Hellwig	6c23f0bd7f	dm: add a clone_to_tio helper Add a helper to stop open coding the container_of operations to get from the clone bio to the tio structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-04 07:43:17 -07:00
Song Liu	0f9650bd83	md: fix NULL pointer deref with nowait but no mddev->queue Leon reported NULL pointer deref with nowait support: [ 15.123761] device-mapper: raid: Loading target version 1.15.1 [ 15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 15.124192] device-mapper: raid: Choosing default region size of 4MiB [ 15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 15.129530] #PF: supervisor write access in kernel mode [ 15.129533] #PF: error_code(0x0002) - not-present page [ 15.129535] PGD 0 P4D 0 [ 15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e [ 15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021 [ 15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20 [ 15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \ 00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \ 31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00 [ 15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202 [ 15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000 [ 15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d [ 15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000 [ 15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070 [ 15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001 [ 15.129572] FS: 00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000 [ 15.129575] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0 [ 15.129580] Call Trace: [ 15.129582] <TASK> [ 15.129584] md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531] [ 15.129597] raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63] [ 15.129604] ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129615] dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129625] table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129635] ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129644] ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129655] dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129663] __x64_sys_ioctl+0x8e/0xd0 [ 15.129667] do_syscall_64+0x5c/0x90 [ 15.129672] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129675] ? do_syscall_64+0x69/0x90 [ 15.129677] ? do_syscall_64+0x69/0x90 [ 15.129679] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129682] ? do_syscall_64+0x69/0x90 [ 15.129684] ? do_syscall_64+0x69/0x90 [ 15.129686] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 15.129689] RIP: 0033:0x7fa96ecd559b [ 15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \ c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \ ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b [ 15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003 [ 15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27 [ 15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610 [ 15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530 [ 15.129709] </TASK> This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check. Fixes: `f51d46d0e7` ("md: add support for REQ_NOWAIT") Reported-by: Leon Möller <jkhsjdhjs@totally.rip> Tested-by: Leon Möller <jkhsjdhjs@totally.rip> Cc: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>	2022-02-02 10:14:07 -08:00
Christoph Hellwig	a7c50c9404	block: pass a block_device and opf to bio_reset Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:50:00 -07:00
Christoph Hellwig	49add4966d	block: pass a block_device and opf to bio_init Pass the block_device that we plan to use this bio for and the operation to bio_init to optimize the assignment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	07888c665b	block: pass a block_device and opf to bio_alloc Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	609be10667	block: pass a block_device and opf to bio_alloc_bioset Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	28d7d128aa	dm-thin: use blkdev_issue_flush instead of open coding it Use blkdev_issue_flush, which uses an on-stack bio instead of an opencoded version with a bio embedded into struct pool. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124091107.642561-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	eba33b8ef1	dm-snap: use blkdev_issue_flush instead of open coding it Use blkdev_issue_flush, which uses an on-stack bio instead of an opencoded version with a bio embedded into struct dm_snapshot. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124091107.642561-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	3f868c09ea	dm-crypt: remove clone_init Just open code it next to the bio allocations, which saves a few lines of code, prepares for future changes and allows to remove the duplicate bi_opf assignment for the bio_clone_fast case in kcryptd_io_read. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124091107.642561-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	53db984e00	dm: bio_alloc can't fail if it is allowed to sleep Remove handling of NULL returns from sleeping bio_alloc calls given that those can't fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220124091107.642561-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Christoph Hellwig	322cbb50de	block: remove genhd.h There is no good reason to keep genhd.h separate from the main blkdev.h header that includes it. So fold the contents of genhd.h into blkdev.h and remove genhd.h entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-02 07:49:59 -07:00
Mike Snitzer	b879f915bc	dm: properly fix redundant bio-based IO accounting Record the start_time for a bio but defer the starting block core's IO accounting until after IO is submitted using bio_start_io_acct_time(). This approach avoids the need to mess around with any of the individual IO stats in response to a bio_split() that follows bio submission. Reported-by: Bud Brown <bubrown@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Depends-on: `e45c47d1f9` ("block: add bio_start_io_acct_time() to control start_time") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220128155841.39644-4-snitzer@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-28 12:28:15 -07:00
Mike Snitzer	f524d9c95f	dm: revert partial fix for redundant bio-based IO accounting Reverts `a1e1cb72d9` ("dm: fix redundant IO accounting for bios that need splitting") because it was too narrow in scope (only addressed redundant 'sectors[]' accounting and not ios, nsecs[], etc). Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220128155841.39644-3-snitzer@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-28 12:28:15 -07:00
Greg Kroah-Hartman	fa97cb843c	bcache: use default_groups in kobj_type There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the bcache sysfs code to use default_groups field which has been the preferred way since `aa30f47cf6` ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220106100004.3277439-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-01-26 15:56:18 +01:00
Linus Torvalds	3acbdbf42e	dax + libnvdimm for v5.17 - Simplify the dax_operations API - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api Tags offered after the branch was cut: Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8 f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA= =QvDs -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax and libnvdimm updates from Dan Williams: "The bulk of this is a rework of the dax_operations API after discovering the obstacles it posed to the work-in-progress DAX+reflink support for XFS and other copy-on-write filesystem mechanics. Primarily the need to plumb a block_device through the API to handle partition offsets was a sticking point and Christoph untangled that dependency in addition to other cleanups to make landing the DAX+reflink support easier. The DAX_PMEM_COMPAT option has been around for 4 years and not only are distributions shipping userspace that understand the current configuration API, but some are not even bothering to turn this option on anymore, so it seems a good time to remove it per the deprecation schedule. Recall that this was added after the device-dax subsystem moved from /sys/class/dax to /sys/bus/dax for its sysfs organization. All recent functionality depends on /sys/bus/dax. Some other miscellaneous cleanups and reflink prep patches are included as well. Summary: - Simplify the dax_operations API: - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api" * tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits) iomap: Fix error handling in iomap_zero_iter() ACPI: NFIT: Import GUID before use dax: remove the copy_from_iter and copy_to_iter methods dax: remove the DAXDEV_F_SYNC flag dax: simplify dax_synchronous and set_dax_synchronous uio: remove copy_from_iter_flushcache() and copy_mc_to_iter() iomap: turn the byte variable in iomap_zero_iter into a ssize_t memremap: remove support for external pgmap refcounts fsdax: don't require CONFIG_BLOCK iomap: build the block based code conditionally dax: fix up some of the block device related ifdefs fsdax: shift partition offset handling into the file systems dax: return the partition offset from fs_dax_get_by_bdev iomap: add a IOMAP_DAX flag xfs: pass the mapping flags to xfs_bmbt_to_iomap xfs: use xfs_direct_write_iomap_ops for DAX zeroing xfs: move dax device handling into xfs_{alloc,free}_buftarg ext4: cleanup the dax handling in ext4_fill_super ext2: cleanup the dax handling in ext2_fill_super fsdax: decouple zeroing from the iomap buffered I/O code ...	2022-01-12 15:46:11 -08:00
Linus Torvalds	49008f0cc1	- Fixes and improvements to dm btree and dm space map code in persistent-data library used by thinp and cache. - Update DM integrity to use struct_group() to zero struct journal_sector. - Update DM sysfs to use default_groups in kobj_type. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmHe/s4ACgkQxSPxCi2d A1rlwggA4esorVQSM86KcoHjRqatof1NBPoAdTPt1BykO9wl/HBCn4YGOPJVWkoP Dwe1PUvlH0ZbpnW/J7IrMpdT+IyXJVzQSmFJsQTFjMWXQM2OCDKz5YJc4KQzFBda Hl30kChoot58kxtZ29FLz4t3cpXG8lkB3tz3O/Z2McGQTymNM6Ekbv7GXwX5jfDo QpRsXEJ2L2jyVFxwE6kCIpDIiOI0a3+3OvHQneTGunCnhqZ6N3ii8llKbw/OAzHI jE+q+tJ6NeNMVMdETenrFXSBcbxo07VG87hWj62XbGzHaeYE5vWu1UCWPjx2kqS3 +Co7+q02v925cThrRGftc5xQlB88+Q== =YgBt -----END PGP SIGNATURE----- Merge tag 'for-5.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fixes and improvements to dm btree and dm space map code in persistent-data library used by thinp and cache. - Update DM integrity to use struct_group() to zero struct journal_sector. - Update DM sysfs to use default_groups in kobj_type. * tag 'for-5.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm sysfs: use default_groups in kobj_type dm integrity: Use struct_group() to zero struct journal_sector dm space map common: add bounds check to sm_ll_lookup_bitmap() dm btree: add a defensive bounds check to insert_at() dm btree remove: change a bunch of BUG_ON() calls to proper errors dm btree spine: eliminate duplicate le32_to_cpu() in node_check() dm btree spine: remove extra node_check function declaration	2022-01-12 10:40:11 -08:00
Linus Torvalds	c9193f48e9	for-5.17/drivers-2022-01-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8EIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnOKEADGpxp+Vntbm8nZI/PFP5fA2gUTZWgSVB4l axVTYW21pjSrsrAhGg2FIgBgL0tNkgxQnIPRn50YL8jT3pTkCEcR7kLbhEU7W/Ln 7hrsBgFnsCBoCs38LvzXHZD69jtEtNRk1ijPMLo5iCcHkAyUVKa1glfeMwefuI5/ Rl8SoueRXppvCfwNPptaAKiDsYVN8KCJPvvhlMNoKP5n1iTsNYJ/HVsLqfRnP0oc CR6eHaYceWGLER8tWtBlG2Qp40+cd/A320thkIlEpEKJPWE/ce5AUp0PYxVJbwjU qvO1tMYSya7gPiaVWRJcUeAgRFiivM/kTdDrGwiY9hpv/BQG7EAW5D9Xecz/M4UG BgNLfhe0aR9QssjPxITgyiy9sRpwwpnpoVONTu3slgXVTUVlOq0QT6LOTPR1B9A4 ZjbHVCuI3eyrAOqD4IjYSqjHa6GjFLiKTh8Q0ZB/KJGX1eItLVLVdJfcfV4RkBIf 6RZg9+7/mXaDxU74DZ2tfUhHT0sC5RS+5VFxpkhThVk9qRbVdZGGWAHcVOkMjk9B L4PCpJeuaR+rzXvCDOCOI5sHraa5F/IRhMaTu5sHj/MIuEpq1fqjaB7tWRvfm6HO 4tepUtb++rS3/zFFQlZCLyjVk2o0p2b0viwPLjvsRqsBp1bVoO9mJIiyp6POmM3G UjxQS0vEDw== =k0IZ -----END PGP SIGNATURE----- Merge tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - mtip32xx pci cleanups (Bjorn) - mtip32xx conversion to generic power management (Vaibhav) - rsxx pci powermanagement cleanups (Bjorn) - Remove the rsxx driver. This hardware never saw much adoption, and it's been end of lifed for a while. (Christoph) - MD pull request from Song: - REQ_NOWAIT support (Vishal Verma) - raid6 benchmark optimization (Dirk Müller) - Fix for acct bioset (Xiao Ni) - Clean up max_queued_requests (Mariusz Tkaczyk) - PREEMPT_RT optimization (Davidlohr Bueso) - Use default_groups in kobj_type (Greg Kroah-Hartman) - Use attribute groups in pktcdvd and rnbd (Greg) - NVMe pull request from Christoph: - increment request genctr on completion (Keith Busch, Geliang Tang) - add a 'iopolicy' module parameter (Hannes Reinecke) - print out valid arguments when reading from /dev/nvme-fabrics (Hannes Reinecke) - Use struct_group() in drbd (Kees) - null_blk fixes (Ming) - Get rid of congestion logic in pktcdvd (Neil) - Floppy ejection hang fix (Tasos) - Floppy max user request size fix (Xiongwei) - Loop locking fix (Tetsuo) * tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits) md: use default_groups in kobj_type md: Move alloc/free acct bioset in to personality lib/raid6: Use strict priority ranking for pq gen() benchmarking lib/raid6: skip benchmark of non-chosen xor_syndrome functions md: fix spelling of "its" md: raid456 add nowait support md: raid10 add nowait support md: raid1 add nowait support md: add support for REQ_NOWAIT md: drop queue limitation for RAID1 and RAID10 md/raid5: play nice with PREEMPT_RT block/rnbd-clt-sysfs: use default_groups in kobj_type pktcdvd: convert to use attribute groups block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0 nvme: add 'iopolicy' module parameter nvme: drop unused variable ctrl in nvme_setup_cmd nvme: increment request genctr on completion nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics block: remove the rsxx driver rsxx: Drop PCI legacy power management ...	2022-01-12 10:35:23 -08:00
Linus Torvalds	d3c8108035	for-5.17/block-2022-01-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8DAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnhRD/wMAjsNO65PCA+o/bPpVi4ulx9EejAzrJnB 5vHFvREAoOOGKvRpYGe4w3TcKyW+zPb+GtlXFjPfK+wuVzWhrQtW/+vkjKlBt8wK o7rzeMwTKJ9ZGvYaaQpp1yC0WURBB3qnCRQhb8dOQzhJgEXinhIOznZsut4mniLv fTqcDmKAb/+G6K6CQCCqnH0I/+OJZyUeSFo1kk2i4ZqCBepQpBkOL6H2rBOtGxUg bt1jiGHbbhCRYEE3u2kV0HP10qAChNaMQC705jV4Qpf4+3EntSxs+6nSb74dvMkX 3+Wmp8Ctq6lpPnDL1nrAFGz3jZnB0Y+GdgOclQn3ViQd1FCXZzuYWQ3fTaBfURCZ /RE5nc047SqpwCFLOynM++OkaeQZ1zSxeyoFTtzDaPF4tLuaX3JHswvTzNGPw8SN BnexseNnNBCjJliZSEE7fOkjJDcev2dvRxPtI8/wkF4lHUgETc5IW563C53xo/Tx 32yFjZwCVIpNWk21su/0H3iEq80wZ7PnriiN/E3JA6XbnevlRPu0NPMb0D258GCm yCcdPVDNZsQCB8hluqZcu0g6LSgZRo90Yg1oqKqEpAllJJMBaEAPPPuUIJh998mo iKGxZzgr7d9jrbGJTInp0F8b3B3/oV/hxgzy0Hu/mHP3AsnaAk9o/oEQZ7rX4Khr 6biloqkIMA== =RWnJ -----END PGP SIGNATURE----- Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Unify where the struct request handling code is located in the blk-mq code (Christoph) - Header cleanups (Christoph) - Clean up the io_context handling code (Christoph, me) - Get rid of ->rq_disk in struct request (Christoph) - Error handling fix for add_disk() (Christoph) - request allocation cleanusp (Christoph) - Documentation updates (Eric, Matthew) - Remove trivial crypto unregister helper (Eric) - Reduce shared tag overhead (John) - Reduce poll_stats memory overhead (me) - Known indirect function call for dio (me) - Use atomic references for struct request (me) - Support request list issue for block and NVMe (me) - Improve queue dispatch pinning (Ming) - Improve the direct list issue code (Keith) - BFQ improvements (Jan) - Direct completion helper and use it in mmc block (Sebastian) - Use raw spinlock for the blktrace code (Wander) - fsync error handling fix (Ye) - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me) * tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits) MAINTAINERS: add entries for block layer documentation docs: block: remove queue-sysfs.rst docs: sysfs-block: document virt_boundary_mask docs: sysfs-block: document stable_writes docs: sysfs-block: fill in missing documentation from queue-sysfs.rst docs: sysfs-block: add contact for nomerges docs: sysfs-block: sort alphabetically docs: sysfs-block: move to stable directory block: don't protect submit_bio_checks by q_usage_counter block: fix old-style declaration nvme-pci: fix queue_rqs list splitting block: introduce rq_list_move block: introduce rq_list_for_each_safe macro block: move rq_list macros to blk-mq.h block: drop needless assignment in set_task_ioprio() block: remove unnecessary trailing '\' bio.h: fix kernel-doc warnings block: check minor range in device_add_disk() block: use "unsigned long" for blk_validate_block_size(). block: fix error unwinding in device_add_disk ...	2022-01-12 10:26:52 -08:00
Linus Torvalds	35632d92ef	block-5.16-2022-01-07 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHYRcAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplYyEACwEaL/JVb1fYf5fv7nQiVQPQxDhIFwMB66 1bBV5y3SoQmq0ZO0mEkb3SIProbEVadnTOjcF8+8SB6K0qLefhhBfTRB9jnIfQqL qTGbIMcWv0q2wnBGnRIv1ADzu6KkRT+pXLW9RgLs8ZbQDxH2w9tMSvRYeBKV+xHp J6PPNp5OjuA+Obd8fi7ILNTDh5EIkWQBpZNoByyAUmz9iH8InsALhM3aae2jPbO+ uWdPTxK7QKUvGIS+mfnTF/F+pURyQDwrPLirxq8yNfiAe++ZaryLb6jQcF/m3yKP s1FlGZrYsbRhroRJxa1uRZzVkenHbWTSea/iqFQdf+rJs4RASOr2/lHimorWKt0H HfV+xAcVmZR775KINh+zwImuAvB5ymOyUePUMeAvDyaAeiPsQGQ+VuylNFBvcHNg JvEmSaQfKw5H0qAUyIN70v3bqeOaPF7b50Kyg35a9RbELAvMeVIASokIjmJJ2ndb uVdPo+9FozwDFQ4iMbdJ4fjl9XHKxSKxJKaPRfzz+Tfbx1QSKXkcepT8b8WJ0S7O yweuLRwrlrUW1kFMbAS9sspL94z0o1rF61FjRJdd24D5YnGTqUK9Ciok2mpYqokk jBQX05+nyiyKwrPYJfblDxnQGLzv5VCkhkGmpARla0J1si9OrK9Oxp1lq9Q1XeWY EdO4WmGMDA== =Liaw -----END PGP SIGNATURE----- Merge tag 'block-5.16-2022-01-07' of git://git.kernel.dk/linux-block Pull block fix from Jens Axboe: "Just the md bitmap regression this time" * tag 'block-5.16-2022-01-07' of git://git.kernel.dk/linux-block: md/raid1: fix missing bitmap update w/o WriteMostly devices	2022-01-07 13:28:20 -08:00
Greg Kroah-Hartman	1745e857e7	md: use default_groups in kobj_type There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the md rdev sysfs code to use default_groups field which has been the preferred way since commit `aa30f47cf6` ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 10:42:50 -08:00
Xiao Ni	0c031fd37f	md: Move alloc/free acct bioset in to personality bioset acct is only needed for raid0 and raid5. Therefore, md_run only allocates it for raid0 and raid5. However, this does not cover personality takeover, which may cause uninitialized bioset. For example, the following repro steps: mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1 mdadm --wait /dev/md0 mkfs.xfs /dev/md0 mdadm /dev/md0 --grow -l5 mount /dev/md0 /mnt causes panic like: [ 225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 225.934903] #PF: supervisor instruction fetch in kernel mode [ 225.935639] #PF: error_code(0x0010) - not-present page [ 225.936361] PGD 0 P4D 0 [ 225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI [ 225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706 [ 225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014 [ 225.939922] RIP: 0010:0x0 [ 225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246 [ 225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39 [ 225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800 [ 225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01 [ 225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58 [ 225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040 [ 225.946709] FS: 00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000 [ 225.947814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0 [ 225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 225.951414] Call Trace: [ 225.951787] <TASK> [ 225.952120] mempool_alloc+0xe5/0x250 [ 225.952625] ? mempool_resize+0x370/0x370 [ 225.953187] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.953862] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.954464] ? sched_clock_cpu+0x15/0x120 [ 225.955019] ? find_held_lock+0xac/0xd0 [ 225.955564] bio_alloc_bioset+0x1ed/0x2a0 [ 225.956080] ? lock_downgrade+0x3a0/0x3a0 [ 225.956644] ? bvec_alloc+0xc0/0xc0 [ 225.957135] bio_clone_fast+0x19/0x80 [ 225.957651] raid5_make_request+0x1370/0x1b70 [ 225.958286] ? sched_clock_cpu+0x15/0x120 [ 225.958797] ? __lock_acquire+0x8b2/0x3510 [ 225.959339] ? raid5_get_active_stripe+0xce0/0xce0 [ 225.959986] ? lock_is_held_type+0xd8/0x130 [ 225.960528] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.961135] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.961703] ? sched_clock_cpu+0x15/0x120 [ 225.962232] ? lock_release+0x27a/0x6c0 [ 225.962746] ? do_wait_intr_irq+0x130/0x130 [ 225.963302] ? lock_downgrade+0x3a0/0x3a0 [ 225.963815] ? lock_release+0x6c0/0x6c0 [ 225.964348] md_handle_request+0x342/0x530 [ 225.964888] ? set_in_sync+0x170/0x170 [ 225.965397] ? blk_queue_split+0x133/0x150 [ 225.965988] ? __blk_queue_split+0x8b0/0x8b0 [ 225.966524] ? submit_bio_checks+0x3b2/0x9d0 [ 225.967069] md_submit_bio+0x127/0x1c0 [...] Fix this by moving alloc/free of acct bioset to pers->run and pers->free. While we are on this, properly handle md_integrity_register() error in raid0_run(). Fixes: `daee202471` (md: check level before create and exit io_acct_set) Cc: stable@vger.kernel.org Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:03 -08:00
Randy Dunlap	dd3dc5f416	md: fix spelling of "its" Use the possessive "its" instead of the contraction "it's" in printed messages. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:03 -08:00
Vishal Verma	bf2c411bb1	md: raid456 add nowait support Returns EAGAIN in case the raid456 driver would block waiting for reshape. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:02 -08:00
Vishal Verma	c9aa889b03	md: raid10 add nowait support This adds nowait support to the RAID10 driver. Very similar to raid1 driver changes. It makes RAID10 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, - Reshape operation, - Discard operation. wait_barrier() and regular_request_wait() fn are modified to return bool to support error for wait barriers. They returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:02 -08:00
Vishal Verma	5aa705039c	md: raid1 add nowait support This adds nowait support to the RAID1 driver. It makes RAID1 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, wait_barrier() fn is modified to return bool to support error for wait barriers. It returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:02 -08:00
Vishal Verma	f51d46d0e7	md: add support for REQ_NOWAIT commit `021a24460d` ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit `6abc49468e` ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios. This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created. md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks Before patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL \| grep $(pidof io_uring) 38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397 We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait. After patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL \| grep $(pidof io_uring) 38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait. For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:02 -08:00
Mariusz Tkaczyk	a92ce0feff	md: drop queue limitation for RAID1 and RAID10 As suggested by Neil Brown[1], this limitation seems to be deprecated. With plugging in use, writes are processed behind the raid thread and conf->pending_count is not increased. This limitation occurs only if caller doesn't use plugs. It can be avoided and often it is (with plugging). There are no reports that queue is growing to enormous size so remove queue limitation for non-plugged IOs too. [1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.name Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>	2022-01-06 08:37:02 -08:00
Davidlohr Bueso	770b1d216d	md/raid5: play nice with PREEMPT_RT raid_run_ops() relies on the implicitly disabled preemption for its percpu ops, although this is really about CPU locality. This breaks RT semantics as it can take regular (and thus sleeping) spinlocks, such as stripe_lock. Add a local_lock such that non-RT does not change and continues to be just map to preempt_disable/enable, but makes RT happy as the region will use a per-CPU spinlock and thus be preemptible and still guarantee CPU locality. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>	2022-01-06 08:37:02 -08:00
Greg Kroah-Hartman	eaac0b590a	dm sysfs: use default_groups in kobj_type There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the dm sysfs code to use default_groups field which has been the preferred way since `aa30f47cf6` ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-06 09:48:55 -05:00
Kees Cook	f069c7ab6c	dm integrity: Use struct_group() to zero struct journal_sector In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Add struct_group() to mark region of struct journal_sector that should be initialized to zero. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-06 09:48:33 -05:00
Joe Thornber	cba23ac158	dm space map common: add bounds check to sm_ll_lookup_bitmap() Corrupted metadata could warrant returning error from sm_ll_lookup_bitmap(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-04 13:58:19 -05:00
Joe Thornber	85bca3c05b	dm btree: add a defensive bounds check to insert_at() Corrupt metadata could trigger an out of bounds write. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-04 13:56:03 -05:00
Joe Thornber	c671ffa55d	dm btree remove: change a bunch of BUG_ON() calls to proper errors Abuse of BUG_ON() is never appropriate, best to propagate errors to fail gracefully (rather than take the entire system down). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-04 13:48:12 -05:00
Joe Thornber	e36649b648	dm btree spine: eliminate duplicate le32_to_cpu() in node_check() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-04 12:59:55 -05:00
Joe Thornber	851a8cd3f0	dm btree spine: remove extra node_check function declaration Should have been removed as part of commit `f73e2e70ec` ("dm btree spine: remove paranoid node_check call in node_prep_for_write()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-01-04 12:52:08 -05:00
Song Liu	46669e8616	md/raid1: fix missing bitmap update w/o WriteMostly devices commit [1] causes missing bitmap updates when there isn't any WriteMostly devices. Detailed steps to reproduce by Norbert (which somehow didn't make to lore): # setup md10 (raid1) with two drives (1 GByte sparse files) dd if=/dev/zero of=disk1 bs=1024k seek=1024 count=0 dd if=/dev/zero of=disk2 bs=1024k seek=1024 count=0 losetup /dev/loop11 disk1 losetup /dev/loop12 disk2 mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/loop11 /dev/loop12 # add bitmap (aka write-intent log) mdadm /dev/md10 --grow --bitmap=internal echo check > /sys/block/md10/md/sync_action root:# cat /sys/block/md10/md/mismatch_cnt 0 root:# # remove member drive disk2 (loop12) mdadm /dev/md10 -f loop12 ; mdadm /dev/md10 -r loop12 # modify degraded md device dd if=/dev/urandom of=/dev/md10 bs=512 count=1 # no blocks recorded as out of sync on the remaining member disk1/loop11 root:# mdadm -X /dev/loop11 \| grep Bitmap Bitmap : 16 bits (chunks), 0 dirty (0.0%) root:# # re-add disk2, nothing synced because of empty bitmap mdadm /dev/md10 --re-add /dev/loop12 # check integrity again echo check > /sys/block/md10/md/sync_action # disk1 and disk2 are no longer in sync, reads return differend data root:# cat /sys/block/md10/md/mismatch_cnt 128 root:# # clean up mdadm -S /dev/md10 losetup -d /dev/loop11 losetup -d /dev/loop12 rm disk1 disk2 Fix this by moving the WriteMostly check to the if condition for alloc_behind_master_bio(). [1] commit `fd3b6975e9` ("md/raid1: only allocate write behind bio for WriteMostly device") Fixes: `fd3b6975e9` ("md/raid1: only allocate write behind bio for WriteMostly device") Cc: stable@vger.kernel.org # v5.12+ Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Norbert Warmuth <nwarmuth@t-online.de> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Song Liu <song@kernel.org>	2022-01-03 14:13:48 -08:00
Christoph Hellwig	7ac5360cd4	dax: remove the copy_from_iter and copy_to_iter methods These methods indirect the actual DAX read/write path. In the end pmem uses magic flush and mc safe variants and fuse and dcssblk use plain ones while device mapper picks redirects to the underlying device. Add set_dax_nocache() and set_dax_nomc() APIs to control which copy routines are used to remove indirect call from the read/write fast path as well as a lot of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs] Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2021-12-18 08:04:53 -08:00
Christoph Hellwig	30c6828a17	dax: remove the DAXDEV_F_SYNC flag Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and just let the drivers call set_dax_synchronous directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2021-12-18 08:04:52 -08:00
Linus Torvalds	fa09ca5ebc	block-5.16-2021-12-17 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG8wJ4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpp+aD/9C6yo/8I2rbP0iJqIWKaADKoU9WGo4E4Q5 NFPsEF17IY2mxZGfvpjMMx2fWPbgfvMuXnWPyfHnFv/00Bm3sr79eQ4Y7Ak8pZWQ r0YzNz1lbgHn8S+EsqWGxWbDZisptaYN2d0skT81dTE62s2dcZc7WYHBJnCW5JZb OTwx68e5QoHrWnoHVnCx8toUszFr0OMiQVRVwbrRtJKVh+2rQODIKz2qs2ajceHo IN2Bk35W1xsv6TWGLKCK4zk9k+/Nz5Cqx+FF59bKAIs3kPy3dt10PDjN30jQGR17 DuGZe08wZ+lCMQGXd1XNcjz3GS9qWlc+mbzVpqUATNMQSGbqwW8JyheiNZwfNClw SzWasQD0KipdBiU0hOxBbDko2n4TVFHbnISfBwz0vHEz54HHHtltNd4iBowJuaq4 VG/zuD2CyWaj7p9Xpymrir1xMonVNTRFRL3mkkU8feTxqTEhy2zIix9CbKdGkU5e rEhAAb5i1+/58A8dcCh9X+zULvXjFL6FfbnCGSMrLEq0ymyysc/+w0BPNzq93Jva F8qSx2nUGq0250u+Boc+wL9uB7AXWwR8addyWWlwaAZ52nqVhciWus4gh1jn8CAV EhO+NFbx/9K1rmPkvbJau1/fgnz4+gJS8iBC0pxlyhvt6rNzMFCJB719jcdr05tG JO0131r51g== =Huz8 -----END PGP SIGNATURE----- Merge tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Fix for hammering on the delayed run queue timer (me) - bcache regression fix for this merge window (Lin) - Fix a divide-by-zero in the blk-iocost code (Tejun) * tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block: bcache: fix NULL pointer reference in cached_dev_detach_finish block: reduce kblockd_mod_delayed_work_on() CPU consumption iocost: Fix divide-by-zero on donation from low hweight cgroup	2021-12-17 11:46:07 -08:00
Linus Torvalds	81eebd5405	- Fix use after free in DM btree remove's rebalance_children(). - Fix DM integrity data corruption, introduced during 5.16 merge, due to improper use of bvec_kmap_local(). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmG7cKcACgkQxSPxCi2d A1qgxggAn4k9cGzxLTT+xgkufQKwsc+WvegR7Amwb/jzKh5XE9KaXDnMkcyz/4GX nHCfynHRgWjPU6V3cESRz/MApG/7sQ6UGtLgIXkZGbeSHIE4aRYf3AECUhpD/uB+ XPX4kTFz0ZvIHYpk4HielOHVA31DQl+GkYXddDXCijXYmG80rpUgUg2fm0+O+TtQ eCQjbQV173KSbi4vlzeDyK9cp2rIGvk/UfmY9cIw1b3Gd5vpCVStW9r+P8MEpSNA ar5exvN9c3AR/VIVfBS/9rw0T+l56M8L0efPrSXEV9/pdiXHFzEx+sGEnDUE5F3o g9K2VwkLtuk3kubiSV/kjNBIB4cZyA== =razj -----END PGP SIGNATURE----- Merge tag 'for-5.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix use after free in DM btree remove's rebalance_children() - Fix DM integrity data corruption, introduced during 5.16 merge, due to improper use of bvec_kmap_local() * tag 'for-5.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm integrity: fix data corruption due to improper use of bvec_kmap_local dm btree remove: fix use after free in rebalance_children()	2021-12-16 10:05:49 -08:00
Mike Snitzer	1cef171abd	dm integrity: fix data corruption due to improper use of bvec_kmap_local Commit `25058d1c72` ("dm integrity: use bvec_kmap_local in __journal_read_write") didn't account for __journal_read_write() later adding the biovec's bv_offset. As such using bvec_kmap_local() caused the start of the biovec to be skipped. Trivial test that illustrates data corruption: # integritysetup format /dev/pmem0 # integritysetup open /dev/pmem0 integrityroot # mkfs.xfs /dev/mapper/integrityroot ... bad magic number bad magic number Metadata corruption detected at xfs_sb block 0x0/0x1000 libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000 releasing dirty buffer (bulk) to free list! Fix this by using kmap_local_page() instead of bvec_kmap_local() in __journal_read_write(). Fixes: `25058d1c72` ("dm integrity: use bvec_kmap_local in __journal_read_write") Reported-by: Tony Asleson <tasleson@redhat.com> Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2021-12-15 14:16:35 -05:00
Lin Feng	aa97f6cdb7	bcache: fix NULL pointer reference in cached_dev_detach_finish Commit `0259d4498b` ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") tries to fix calc_cached_dev_sectors when bcache device detaches, but now we have: cached_dev_detach_finish ... bcache_device_detach(&dc->disk); ... closure_put(&d->c->caching); d->c = NULL; [explicitly set dc->disk.c to NULL] list_move(&dc->list, &uncached_devices); calc_cached_dev_sectors(dc->disk.c); [passing a NULL pointer] ... Upper codeflows shows how bug happens, this patch fix the problem by caching dc->disk.c beforehand, and cache_set won't be freed under us because c->caching closure at least holds a reference count and closure callback __cache_set_unregister only being called by bch_cache_set_stop which using closure_queue(&c->caching), that means c->caching closure callback for destroying cache_set won't be trigger by previous closure_put(&d->c->caching). So at this stage(while cached_dev_detach_finish is calling) it's safe to access cache_set dc->disk.c. Fixes: `0259d4498b` ("bcache: move calc_cached_dev_sectors to proper place on backing device detach") Signed-off-by: Lin Feng <linf@wangsu.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211112053629.3437-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-14 20:32:54 -07:00
zhangyue	07641b5f32	md: fix double free of mddev->private in autorun_array() In driver/md/md.c, if the function autorun_array() is called, the problem of double free may occur. In function autorun_array(), when the function do_md_run() returns an error, the function do_md_stop() will be called. The function do_md_run() called function md_run(), but in function md_run(), the pointer mddev->private may be freed. The function do_md_stop() called the function __md_stop(), but in function __md_stop(), the pointer mddev->private also will be freed without judging null. At this time, the pointer mddev->private will be double free, so it needs to be judged null or not. Signed-off-by: zhangyue <zhangyue1@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>	2021-12-10 09:11:07 -08:00

... 4 5 6 7 8 ...

7421 Коммитов