WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Pankaj Gupta	204d1a6434	md: add comments in md_flush_request() Request coalescing logic is dependent on flush time update in other context. This patch adds comments to understand the code flow better. Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-11-30 10:12:34 -08:00
Pankaj Gupta	81ba3c2462	md: improve variable names in md_flush_request() This patch improves readability by using better variable names in flush request coalescing logic. Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-11-30 10:12:34 -08:00
Kevin Vigor	93decc5636	md/raid10: initialize r10_bio->read_slot before use. In __make_request() a new r10bio is allocated and passed to raid10_read_request(). The read_slot member of the bio is not initialized, and the raid10_read_request() uses it to index an array. This leads to occasional panics. Fix by initializing the field to invalid value and checking for valid value in raid10_read_request(). Cc: stable@vger.kernel.org Signed-off-by: Kevin Vigor <kvigor@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-11-30 10:12:28 -08:00
Dae R. Jeong	c731b84b51	md: fix a warning caused by a race between concurrent md_ioctl()s Syzkaller reports a warning as belows. WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169 ... Call Trace: ... RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169 RSP: 0018:ffff888096027950 EFLAGS: 00010293 RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2 RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007 RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482 R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932 R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408 __blkdev_driver_ioctl block/ioctl.c:304 [inline] blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606 block_ioctl+0xee/0x130 fs/block_dev.c:1930 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is caused by a race between two concurrenct md_ioctl()s closing the array. CPU1 (md_ioctl()) CPU2 (md_ioctl()) ------ ------ set_bit(MD_CLOSING, &mddev->flags); did_set_md_closing = true; WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags)); if(did_set_md_closing) clear_bit(MD_CLOSING, &mddev->flags); Fix the warning by returning immediately if the MD_CLOSING bit is set in &mddev->flags which indicates that the array is being closed. Fixes: `065e519e71` ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-11-30 10:12:18 -08:00
Mikulas Patocka	67aa3ec3db	dm writecache: fix the maximum number of arguments Advance the maximum number of arguments to 16. This fixes issue where certain operations, combined with table configured args, exceed 10 arguments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `48debafe4f` ("dm: add writecache target") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-11-17 10:45:06 -05:00
Mikulas Patocka	e5d41cbca1	dm writecache: advance the number of arguments when reporting max_age When reporting the "max_age" value the number of arguments must advance by two. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `3923d4854e` ("dm writecache: implement gradual cleanup") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-11-17 10:45:06 -05:00
Mikulas Patocka	a7a10bce8a	dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY set. These drivers allocate memory and thus they are not suitable for block I/O processing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-11-17 10:45:06 -05:00
Christoph Hellwig	94d91e7f8c	md: remove a spurious call to revalidate_disk_size in update_size None of the ->resize methods updates the disk size, so calling revalidate_disk_size here won't do anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:34:15 -07:00
Christoph Hellwig	2c247c5169	md: use set_capacity_and_notify Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:34:15 -07:00
Christoph Hellwig	dc2985a8d5	dm-raid: use set_capacity_and_notify Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:34:15 -07:00
Christoph Hellwig	f64d9b2eac	dm: use set_capacity_and_notify Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:34:14 -07:00
Christoph Hellwig	28144f9998	md: use __register_blkdev to allocate devices on demand Use the simpler mechanism attached to major_name to allocate a md device when a currently unregistered minor is accessed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:14:30 -07:00
Christoph Hellwig	a7cb3d2f09	block: remove __blkdev_driver_ioctl Just open code it in the few callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:14:29 -07:00
Christoph Hellwig	118cf084ad	md: implement ->set_read_only to hook into BLKROSET processing Implement the ->set_read_only method instead of parsing the actual ioctl command. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-16 08:14:29 -07:00
Linus Torvalds	4815519ed0	- Improve DM core's bio splitting to use blk_max_size_offset(). Also fix bio splitting for bios that were deferred to the worker thread due to a DM device being suspended. - Remove DM core's special handling of NVMe devices now that block core has internalized efficiencies drivers previously needed to be concerned about (via now removed direct_make_request). - Fix request-based DM to not bounce through indirect dm_submit_bio; instead have block core make direct call to blk_mq_submit_bio(). - Various DM core cleanups to simplify and improve code. - Update DM cryot to not use drivers that set CRYPTO_ALG_ALLOCATES_MEMORY. - Fix DM raid's raid1 and raid10 discard limits for the purposes of linux-stable. But then remove DM raid's discard limits settings now that MD raid can efficiently handle large discards. - A couple small cleanups across various targets. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl+Fx1gTHHNuaXR6ZXJA cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWk5iB/9pONYmtfQ5oBx4jg/PU8cVYYIfOtwS ZtItFbw7T9bkHVZ8d4hDr5LTq898cADuRD5edlR82gDOcXkiJlb5PqU39RoOTVvF Xz87sWzHdGAK7rdnCMAc2hiX3oQOje9o7NxGeGQ/uPaNU+U/vJS0AZtEAwltocBd j9MGESddBC636Gzbg5C0c0frikXd0am6qp6SCYJNpP5I0G2beHk2YX5Jqt9c7zMk 8kyQend5b5RvkPNWTAjkVfWUsIjwYHh6MF48ZoGvD0X3lWjIBiwyxC0UX5hSXq63 kB+nqxbXcvQLEBtJuDZ2bjyvrwzCVLpmfgLgzxOOU8fI5Q2U0zpsPaa0 =6YDu -----END PGP SIGNATURE----- Merge tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Improve DM core's bio splitting to use blk_max_size_offset(). Also fix bio splitting for bios that were deferred to the worker thread due to a DM device being suspended. - Remove DM core's special handling of NVMe devices now that block core has internalized efficiencies drivers previously needed to be concerned about (via now removed direct_make_request). - Fix request-based DM to not bounce through indirect dm_submit_bio; instead have block core make direct call to blk_mq_submit_bio(). - Various DM core cleanups to simplify and improve code. - Update DM cryot to not use drivers that set CRYPTO_ALG_ALLOCATES_MEMORY. - Fix DM raid's raid1 and raid10 discard limits for the purposes of linux-stable. But then remove DM raid's discard limits settings now that MD raid can efficiently handle large discards. - A couple small cleanups across various targets. * tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix request-based DM to not bounce through indirect dm_submit_bio dm: remove special-casing of bio-based immutable singleton target on NVMe dm: export dm_copy_name_and_uuid dm: fix comment in __dm_suspend() dm: fold dm_process_bio() into dm_submit_bio() dm: fix missing imposition of queue_limits from dm_wq_work() thread dm snap persistent: simplify area_io() dm thin metadata: Remove unused local variable when create thin and snap dm raid: remove unnecessary discard limits for raid10 dm raid: fix discard limits for raid1 and raid10 dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY dm: use dm_table_get_device_name() where appropriate in targets dm table: make 'struct dm_table' definition accessible to all of DM core dm: eliminate need for start_io_acct() forward declaration dm: simplify __process_abnormal_io() dm: push use of on-stack flush_bio down to __send_empty_flush() dm: optimize max_io_len() by inlining max_io_len_target_boundary() dm: push md->immutable_target optimization down to __process_bio() dm: change max_io_len() to use blk_max_size_offset() dm table: stack 'chunk_sectors' limit to account for target-specific splitting	2020-10-14 15:05:38 -07:00
Linus Torvalds	7cd4ecd917	drivers-5.10-2020-10-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EYWYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsCgD/9Izy/mbiQMmcBPBuQFds2b2SwPAoB4RVcU NU7pcI3EbAlcj7xDF08Z74Sr6MKyg+JhGid15iw47o+qFq6cxDKiESYLIrFmb70R lUDkPr9J4OLNDSZ6hpM4sE6Qg9bzDPhRbAceDQRtVlqjuQdaOS2qZAjNG4qjO8by 3PDO7XHCW+X4HhXiu2PDCKuwyDlHxggYzhBIFZNf58US2BU8+tLn2gvTSvmTb27F w0s5WU1Q5Q0W9RLrp4YTQi4SIIOq03BTSqpRjqhomIzhSQMieH95XNKGRitLjdap 2mFNJ+5I+DTB/TW2BDBrBRXnoV/QNBJsR0DDFnUZsHEejjXKEVt5BRCpSQC9A0WW XUyVE1K+3GwgIxSI8tjPtyPEGzzhnqJjzHPq4LJLGlQje95v9JZ6bpODB7HHtZQt rbNp8IoVQ0n01nIvkkt/vnzCE9VFbWFFQiiu5/+x26iKZXW0pAF9Dnw46nFHoYZi llYvbKDcAUhSdZI8JuqnSnKhi7sLRNPnApBxs52mSX8qaE91sM2iRFDewYXzaaZG NjijYCcUtopUvojwxYZaLnIpnKWG4OZqGTNw1IdgzUtfdxoazpg6+4wAF9vo7FEP AePAUTKrfkGBm95uAP4bRvXBzS9UhXJvBrFW3grzRZybMj617F01yAR4N0xlMXeN jMLrGe7sWA== =xE9E -----END PGP SIGNATURE----- Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the driver updates for 5.10. A few SCSI updates in here too, in coordination with Martin as they depend on core block changes for the shared tag bitmap. This contains: - NVMe pull requests via Christoph: - fix keep alive timer modification (Amit Engel) - order the PCI ID list more sensibly (Andy Shevchenko) - cleanup the open by controller helper (Chaitanya Kulkarni) - use an xarray for the CSE log lookup (Chaitanya Kulkarni) - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni) - fix nvme_ns_report_zones (Christoph Hellwig) - add a sanity check to nvmet-fc (James Smart) - fix interrupt allocation when too many polled queues are specified (Jeffle Xu) - small nvmet-tcp optimization (Mark Wunderlich) - fix a controller refcount leak on init failure (Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni) - major refactoring of the scanning code (Christoph Hellwig) - MD updates via Song: - Bug fixes in bitmap code, from Zhao Heming - Fix a work queue check, from Guoqing Jiang - Fix raid5 oops with reshape, from Song Liu - Clean up unused code, from Jason Yan - Discard improvements, from Xiao Ni - raid5/6 page offset support, from Yufen Yu - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap, Hannes) - null_blk open/active zone limit support (Niklas) - Set of bcache updates (Coly, Dongsheng, Qinglang)" * tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits) md/raid5: fix oops during stripe resizing md/bitmap: fix memory leak of temporary bitmap md: fix the checking of wrong work queue md/bitmap: md_bitmap_get_counter returns wrong blocks md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks md/raid0: remove unused function is_io_in_chunk_boundary() nvme-core: remove extra condition for vwc nvme-core: remove extra variable nvme: remove nvme_identify_ns_list nvme: refactor nvme_validate_ns nvme: move nvme_validate_ns nvme: query namespace identifiers before adding the namespace nvme: revalidate zone bitmaps in nvme_update_ns_info nvme: remove nvme_update_formats nvme: update the known admin effects nvme: set the queue limits in nvme_update_ns_info nvme: remove the 0 lba_shift check in nvme_update_ns_info nvme: clean up the check for too large logic block sizes nvme: freeze the queue over ->lba_shift updates nvme: factor out a nvme_configure_metadata helper ...	2020-10-13 13:04:41 -07:00
Linus Torvalds	3ad11d7ac8	block-5.10-2020-10-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm ZwfpDh2+Tg== =LzyE -----END PGP SIGNATURE----- Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...	2020-10-13 12:12:44 -07:00
Linus Torvalds	ca1b66922a	* Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. * memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. * New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. * Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. * Misc fixes, improvements and cleanups, as always. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5 V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA= =u1Wg -----END PGP SIGNATURE----- Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype	2020-10-12 10:14:38 -07:00
Song Liu	b44c018cdf	md/raid5: fix oops during stripe resizing KoWei reported crash during raid5 reshape: [ 1032.252932] Oops: 0002 [#1] SMP PTI [...] [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10 [...] [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286 [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000 [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000 [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8 [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8 [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 [ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000 [ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0 [ 1032.252962] Call Trace: [ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy] [ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456] [ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456] [ 1032.252988] handle_stripe+0x592/0x1270 [raid456] [ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456] [ 1032.252999] raid5d+0x35c/0x550 [raid456] [ 1032.253002] ? schedule+0x42/0xb0 [ 1032.253006] ? schedule_timeout+0x10e/0x160 [ 1032.253011] md_thread+0x97/0x160 [ 1032.253015] ? wait_woken+0x80/0x80 [ 1032.253019] kthread+0x104/0x140 [ 1032.253022] ? md_start_sync+0x60/0x60 [ 1032.253024] ? kthread_park+0x90/0x90 [ 1032.253027] ret_from_fork+0x35/0x40 This is because cache_size_mutex was unlocked too early in resize_stripes, which races with grow_one_stripe() that grow_one_stripe() allocates a stripe with wrong pool_size. Fix this issue by unlocking cache_size_mutex after updating pool_size. Cc: <stable@vger.kernel.org> # v4.4+ Reported-by: KoWei Sung <winders@amazon.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:38:10 -07:00
Zhao Heming	1383b347a8	md/bitmap: fix memory leak of temporary bitmap Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:37:39 -07:00
Guoqing Jiang	cf0b9b4821	md: fix the checking of wrong work queue It should check md_rdev_misc_wq instead of md_misc_wq. Fixes: `cc1ffe61c0` ("md: add new workqueue for delete rdev") Cc: <stable@vger.kernel.org> # v5.8+ Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:36:30 -07:00
Zhao Heming	d837f7277f	md/bitmap: md_bitmap_get_counter returns wrong blocks md_bitmap_get_counter() has code: ``` if (bitmap->bp[page].hijacked \|\| bitmap->bp[page].map == NULL) csize = ((sector_t)1) << (bitmap->chunkshift + PAGE_COUNTER_SHIFT - 1); ``` The minus 1 is wrong, this branch should report 2048 bits of space. With "-1" action, this only report 1024 bit of space. This bug code returns wrong blocks, but it doesn't inflence bitmap logic: 1. Most callers focus this function return value (the counter of offset), not the parameter blocks. 2. The bug is only triggered when hijacked is true or map is NULL. the hijacked true condition is very rare. the "map == null" only true when array is creating or resizing. 3. Even the caller gets wrong blocks, current code makes caller just to call md_bitmap_get_counter() one more time. Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:31:29 -07:00
Zhao Heming	a913096dec	md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks The patched code is used to get chunks number, should use round-up div to replace current sector_div. The same code is in md_bitmap_resize(): ``` chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift); ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:31:29 -07:00
Jason Yan	d7a1c483f7	md/raid0: remove unused function is_io_in_chunk_boundary() This function is no longger needed after commit `20d0189b10` ("block: Introduce new bio_split()"). Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-10-08 22:31:29 -07:00
Mike Snitzer	681cc5e866	dm: fix request-based DM to not bounce through indirect dm_submit_bio It is unnecessary to force request-based DM to call into bio-based dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then call blk_mq_submit_bio(). Fix this by establishing a request-based DM block_device_operations (dm_rq_blk_dops, which doesn't have .submit_bio) and update dm_setup_md_queue() to set md->disk->fops to it for DM_TYPE_REQUEST_BASED. Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport blk_mq_submit_bio. Fixes: `c62b37d96b` ("block: move ->make_request_fn to struct block_device_operations") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-10-07 18:08:51 -04:00
Mike Snitzer	9c37de297f	dm: remove special-casing of bio-based immutable singleton target on NVMe Since commit `5a6c35f9af` ("block: remove direct_make_request") there is no benefit to DM special-casing NVMe. Remove all code used to establish DM_TYPE_NVME_BIO_BASED. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-10-07 18:08:41 -04:00
Dan Williams	ec6347bb43	x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com	2020-10-06 11:18:04 +02:00
Eric Biggers	07560151db	block: make bio_crypt_clone() able to fail bio_crypt_clone() assumes its gfp_mask argument always includes __GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed. However, bio_crypt_clone() might be called with GFP_ATOMIC via setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via kcryptd_io_read() in drivers/md/dm-crypt.c. Neither case is currently reachable with a bio that actually has an encryption context. However, it's fragile to rely on this. Just make bio_crypt_clone() able to fail, analogous to bio_integrity_clone(). Reported-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Satya Tangirala <satyat@google.com> Cc: Satya Tangirala <satyat@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-05 10:47:43 -06:00
Coly Li	4a784266c6	bcache: remove embedded struct cache_sb from struct cache_set Since bcache code was merged into mainline kerrnel, each cache set only as one single cache in it. The multiple caches framework is here but the code is far from completed. Considering the multiple copies of cached data can also be stored on e.g. md raid1 devices, it is unnecessary to support multiple caches in one cache set indeed. The previous preparation patches fix the dependencies of explicitly making a cache set only have single cache. Now we don't have to maintain an embedded partial super block in struct cache_set, the in-memory super block can be directly referenced from struct cache. This patch removes the embedded struct cache_sb from struct cache_set, and fixes all locations where the superb lock was referenced from this removed super block by referencing the in-memory super block of struct cache. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	6f9414e0f6	bcache: check and set sync status on cache's in-memory super block Currently the cache's sync status is checked and set on cache set's in- memory partial super block. After removing the embedded struct cache_sb from cache set and reference cache's in-memory super block from struct cache_set, the sync status can set and check directly on cache's super block. This patch checks and sets the cache sync status directly on cache's in-memory super block. This is a preparation for later removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	ebaa1ac12b	bcache: remove can_attach_cache() After removing the embedded struct cache_sb from struct cache_set, cache set will directly reference the in-memory super block of struct cache. It is unnecessary to compare block_size, bucket_size and nr_in_set from the identical in-memory super block in can_attach_cache(). This is a preparation patch for latter removing cache_set->sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	08a1782881	bcache: don't check seq numbers in register_cache_set() In order to update the partial super block of cache set, the seq numbers of cache and cache set are checked in register_cache_set(). If cache's seq number is larger than cache set's seq number, cache set must update its partial super block from cache's super block. It is unncessary when the embedded struct cache_sb is removed from struct cache set. This patch removed the seq numbers checking from register_cache_set(), because later there will be no such partial super block in struct cache set, the cache set will directly reference in-memory super block from struct cache. This is a preparation patch for removing embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	63a96c05cd	bcache: only use bucket_bytes() on struct cache Because struct cache_set and struct cache both have struct cache_sb, macro bucket_bytes() currently are used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all bucket_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	3c4fae2982	bcache: remove useless bucket_pages() It seems alloc_bucket_pages() is the only user of bucket_pages(). Considering alloc_bucket_pages() is removed from bcache code, it is safe to remove the useless macro bucket_pages() now. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	421cf1c573	bcache: remove useless alloc_bucket_pages() Now no one uses alloc_bucket_pages() anymore, remove it from bcache.h. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	4e1ebae3ee	bcache: only use block_bytes() on struct cache Because struct cache_set and struct cache both have struct cache_sb, therefore macro block_bytes() can be used on both of them. When removing the embedded struct cache_sb from struct cache_set, this macro won't be used on struct cache_set anymore. This patch unifies all block_bytes() usage only on struct cache, this is one of the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	1132e56e78	bcache: add set_uuid in struct cache_set This patch adds a separated set_uuid[16] in struct cache_set, to store the uuid of the cache set. This is the preparation to remove the embedded struct cache_sb from struct cache_set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:30 -06:00
Coly Li	08fdb2cddb	bcache: remove for_each_cache() Since now each cache_set explicitly has single cache, for_each_cache() is unnecessary. This patch removes this macro, and update all locations where it is used, and makes sure all code logic still being consistent. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Coly Li	697e23495c	bcache: explicitly make cache_set only have single cache Currently although the bcache code has a framework for multiple caches in a cache set, but indeed the multiple caches never completed and users use md raid1 for multiple copies of the cached data. This patch does the following change in struct cache_set, to explicitly make a cache_set only have single cache, - Change pointer array "cache[MAX_CACHES_PER_SET]" to a single pointer "cache". - Remove pointer array "*cache_by_alloc[MAX_CACHES_PER_SET]". - Remove "caches_loaded". Now the code looks as exactly what it does in practic: only one cache is used in the cache set. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Coly Li	17e4aed830	bcache: remove 'int n' from parameter list of bch_bucket_alloc_set() The parameter 'int n' from bch_bucket_alloc_set() is not cleared defined. From the code comments n is the number of buckets to alloc, but from the code itself 'n' is the maximum cache to iterate. Indeed all the locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1. This patch removes the confused and unnecessary 'int n' from parameter list of bch_bucket_alloc_set(), and explicitly allocates only 1 bucket for its caller. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Qinglang Miao	84e5d1363c	bcache: Convert to DEFINE_SHOW_ATTRIBUTE Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. As inode->iprivate equals to third parameter of debugfs_create_file() which is NULL. So it's equivalent to original code logic. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Dongsheng Yang	7e59c506c3	bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve() In mca_reserve(c) macro, we are checking root whether is NULL or not. But that's not enough, when we read the root node in run_cache_set(), if we got an error in bch_btree_node_read_done(), we will return ERR_PTR(-EIO) to c->root. And then we will go continue to unregister, but before calling unregister_shrinker(&c->shrink), there is a possibility to call bch_mca_count(), and we would get a crash with call trace like that: [ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5 ... ... [ 2150.598931] Call trace: [ 2150.606439] bch_mca_count+0x58/0x98 [escache] [ 2150.615866] do_shrink_slab+0x54/0x310 [ 2150.624429] shrink_slab+0x248/0x2d0 [ 2150.632633] drop_slab_node+0x54/0x88 [ 2150.640746] drop_slab+0x50/0x88 [ 2150.648228] drop_caches_sysctl_handler+0xf0/0x118 [ 2150.657219] proc_sys_call_handler.isra.18+0xb8/0x110 [ 2150.666342] proc_sys_write+0x40/0x50 [ 2150.673889] __vfs_write+0x48/0x90 [ 2150.681095] vfs_write+0xac/0x1b8 [ 2150.688145] ksys_write+0x6c/0xd0 [ 2150.695127] __arm64_sys_write+0x24/0x30 [ 2150.702749] el0_svc_handler+0xa0/0x128 [ 2150.710296] el0_svc+0x8/0xc Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Coly Li	a58e88bfdc	bcache: share register sysfs with async register Previously the experimental async registration uses a separate sysfs file register_async. Now the async registration code seems working well for a while, we can do furtuher testing with it now. This patch changes the async bcache registration shares the same sysfs file /sys/fs/bcache/register (and register_quiet). Async registration will be default behavior if BCACHE_ASYNC_REGISTRATION is set in kernel configure. By default, BCACHE_ASYNC_REGISTRATION is not configured yet. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-02 14:25:29 -06:00
Mike Snitzer	61931c0ee9	dm: export dm_copy_name_and_uuid Allow DM targets to access the configured name and uuid. Also, bump DM ioctl version. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-10-01 15:03:40 -04:00
Mike Snitzer	0cede372ce	dm: fix comment in __dm_suspend() Fix stale references to functions that have been renamed and fix typo. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-10-01 15:03:39 -04:00
Mike Snitzer	b2abdb1b4b	dm: fold dm_process_bio() into dm_submit_bio() dm_process_bio() is only called by dm_submit_bio(), there is no benefit to keeping dm_process_bio() factored out, so fold it. While at it, cleanup dm_submit_bio()'s DMF_BLOCK_IO_FOR_SUSPEND related branching and expand scope of dm_get_live_table() rcu reference on map via common 'out' label to dm_put_live_table(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-10-01 15:03:38 -04:00
Mike Snitzer	0c2915b8c6	dm: fix missing imposition of queue_limits from dm_wq_work() thread If a DM device was suspended when bios were issued to it, those bios would be deferred using queue_io(). Once the DM device was resumed dm_process_bio() could be called by dm_wq_work() for original bio that still needs splitting. dm_process_bio()'s check for current->bio_list (meaning call chain is within ->submit_bio) as a prerequisite for calling blk_queue_split() for "abnormal IO" would result in dm_process_bio() never imposing corresponding queue_limits (e.g. discard_granularity, discard_max_bytes, etc). Fix this by always having dm_wq_work() resubmit deferred bios using submit_bio_noacct(). Side-effect is blk_queue_split() is always called for "abnormal IO" from ->submit_bio, be it from application thread or dm_wq_work() workqueue, so proper bio splitting and depth-first bio submission is performed. For sake of clarity, remove current->bio_list check before call to blk_queue_split(). Also, remove dm_wq_work()'s use of dm_{get,put}_live_table() -- no longer needed since IO will be reissued in terms of ->submit_bio. And rename bio variable from 'c' to 'bio'. Fixes: `cf9c378655` ("dm: fix comment in dm_process_bio()") Reported-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-30 15:34:38 -04:00
Qinglang Miao	7d837c0dd9	dm snap persistent: simplify area_io() Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:12 -04:00
Huaisheng Ye	399c9bdbd6	dm thin metadata: Remove unused local variable when create thin and snap The local variable disk details is not used during the creating of thin & snap devices. Remove them from dm-thin-metadata, and add pointer validity check for pointer value in btree_lookup_raw. Skip memory copy when the caller doesn't need the value. Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:11 -04:00
Mike Snitzer	f0e90b6c66	dm raid: remove unnecessary discard limits for raid10 Commit `bcc90d2804` ("md/raid10: improve raid10 discard request") removes raid10's inability to properly handle large discards. So eliminate associated constraint from dm-raid's raid10 support. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:10 -04:00
Mike Snitzer	e0910c8e4f	dm raid: fix discard limits for raid1 and raid10 Block core warned that discard_granularity was 0 for dm-raid with personality of raid1. Reason is that raid_io_hints() was incorrectly special-casing raid1 rather than raid0. But since commit `29efc390b9` ("md/md0: optimize raid0 discard handling") even raid0 properly handles large discards. Fix raid_io_hints() by removing discard limits settings for raid1. Also, fix limits for raid10 by properly stacking underlying limits as done in blk_stack_limits(). Depends-on: `29efc390b9` ("md/md0: optimize raid0 discard handling") Fixes: `61697a6abd` ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:09 -04:00
Mikulas Patocka	cd74693870	dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY set. These drivers allocate memory and thus they are unsuitable for block I/O processing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:08 -04:00
Mike Snitzer	d4a512edcc	dm: use dm_table_get_device_name() where appropriate in targets dm_table_get_device_name() avoids calling dm_table_get_md() followed by dm_device_name() -- saves intermediate dm_table_get_md() call. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:08 -04:00
Mike Snitzer	33bd6f0693	dm table: make 'struct dm_table' definition accessible to all of DM core Move 'struct dm_table' definition from dm-table.c to dm-core.h and update DM core to access its members directly. Helps optimize max_io_len() and other methods slightly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:07 -04:00
Mike Snitzer	7465d7ac50	dm: eliminate need for start_io_acct() forward declaration Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:06 -04:00
Mike Snitzer	9679b5a7ec	dm: simplify __process_abnormal_io() Only call bio_op() once in switch statement. Also remove the excessive factoring out to one line functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:05 -04:00
Mike Snitzer	828678b87e	dm: push use of on-stack flush_bio down to __send_empty_flush() Eliminates duplicate code, no functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:04 -04:00
Mike Snitzer	3720281db9	dm: optimize max_io_len() by inlining max_io_len_target_boundary() Saves redundant dm_target_offset() math. Also, reverse argument order for max_io_len() to be consistent with other similar functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:04 -04:00
Mike Snitzer	094ee64d7d	dm: push md->immutable_target optimization down to __process_bio() Also, update associated stale comment in __bind(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:03 -04:00
Mike Snitzer	5091cdec56	dm: change max_io_len() to use blk_max_size_offset() Using blk_max_size_offset() enables DM core's splitting to impose ti->max_io_len (via q->limits.chunk_sectors) and also fallback to respecting q->limits.max_sectors if chunk_sectors isn't set. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:02 -04:00
Mike Snitzer	882ec4e609	dm table: stack 'chunk_sectors' limit to account for target-specific splitting If target set ti->max_io_len it must be used when stacking DM device's queue_limits to establish a 'chunk_sectors' that is compatible with the IO stack. By using lcm_not_zero() care is taken to avoid blindly overriding the chunk_sectors limit stacked up by blk_stack_limits(). Depends-on: `07d098e6bb` ("block: allow 'chunk_sectors' to be non-power-of-2") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:33:01 -04:00
Mike Snitzer	1471308fb5	Merge remote-tracking branch 'jens/for-5.10/block' into dm-5.10 DM depends on these block 5.10 commits: `22ada802ed` block: use lcm_not_zero() when stacking chunk_sectors `07d098e6bb` block: allow 'chunk_sectors' to be non-power-of-2 `021a24460d` block: add QUEUE_FLAG_NOWAIT `6abc49468e` dm: add support for REQ_NOWAIT and enable it for linear target Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-29 16:31:35 -04:00
Konstantin Khlebnikov	6abc49468e	dm: add support for REQ_NOWAIT and enable it for linear target Add DM target feature flag DM_TARGET_NOWAIT which advertises that target works with REQ_NOWAIT bios. Add dm_table_supports_nowait() and update dm_table_set_restrictions() to set/clear QUEUE_FLAG_NOWAIT accordingly. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 08:20:03 -06:00
Christoph Hellwig	4245e52d25	md: don't detour through bd_contains for the gendisk bd_disk is set on all block devices, including those for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 08:18:57 -06:00
Christoph Hellwig	61a27e1f52	md: compare bd_disk instead of bd_contains To check for partitions of the same disk bd_contains works as well, but bd_disk is way more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 08:18:57 -06:00
Christoph Hellwig	fa01b1e973	block: add a bdev_is_partition helper Add a littler helper to make the somewhat arcane bd_contains checks a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 08:18:57 -06:00
Xiao Ni	d3ee2d8415	md/raid10: improve discard request for far layout For far layout, the discard region is not continuous on disks. So it needs far copies r10bio to cover all regions. It needs a way to know all r10bios have finish or not. Similar with raid10_sync_request, only the first r10bio master_bio records the discard bio. Other r10bios master_bio record the first r10bio. The first r10bio can finish after other r10bios finish and then return the discard bio. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Xiao Ni	bcc90d2804	md/raid10: improve raid10 discard request Now the discard request is split by chunk size. So it takes a long time to finish mkfs on disks which support discard function. This patch improve handling raid10 discard request. It uses the similar way with patch `29efc390b` (md/md0: optimize raid0 discard handling). But it's a little complex than raid0. Because raid10 has different layout. If raid10 is offset layout and the discard request is smaller than stripe size. There are some holes when we submit discard bio to underlayer disks. For example: five disks (disk1 - disk5) D01 D02 D03 D04 D05 D05 D01 D02 D03 D04 D06 D07 D08 D09 D10 D10 D06 D07 D08 D09 The discard bio just wants to discard from D03 to D10. For disk3, there is a hole between D03 and D08. For disk4, there is a hole between D04 and D09. D03 is a chunk, raid10_write_request can handle one chunk perfectly. So the part that is not aligned with stripe size is still handled by raid10_write_request. If reshape is running when discard bio comes and the discard bio spans the reshape position, raid10_write_request is responsible to handle this discard bio. I did a test with this patch set. Without patch: time mkfs.xfs /dev/md0 real4m39.775s user0m0.000s sys0m0.298s With patch: time mkfs.xfs /dev/md0 real0m0.105s user0m0.000s sys0m0.007s nvme3n1 259:1 0 477G 0 disk └─nvme3n1p1 259:10 0 50G 0 part nvme4n1 259:2 0 477G 0 disk └─nvme4n1p1 259:11 0 50G 0 part nvme5n1 259:6 0 477G 0 disk └─nvme5n1p1 259:12 0 50G 0 part nvme2n1 259:9 0 477G 0 disk └─nvme2n1p1 259:15 0 50G 0 part nvme0n1 259:13 0 477G 0 disk └─nvme0n1p1 259:14 0 50G 0 part Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Xiao Ni	f046f5d0d7	md/raid10: pull codes that wait for blocked dev into one function The following patch will reuse these logics, so pull the same codes into one function. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Xiao Ni	8650a88901	md/raid10: extend r10bio devs to raid disks Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit to all member disks and it needs to use r10bio. So extend to r10bio->devs[geo.raid_disks]. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Xiao Ni	2628089b74	md: add md_submit_discard_bio() for submitting discard bio Move these logic from raid0.c to md.c, so that we can also use it in raid10.c. Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Zhen Lei	e287308b83	md: Simplify code with existing definition RESYNC_SECTORS in raid10.c #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9) "RESYNC_BLOCK_SIZE/512" is equal to "RESYNC_BLOCK_SIZE >> 9", replace it with RESYNC_SECTORS. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Yufen Yu	3891258443	md/raid5: reallocate page array after setting new stripe_size When try to resize stripe_size, we also need to free old shared page array and allocate new. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Yufen Yu	f16acaf328	md/raid5: resize stripe_head when reshape array When reshape array, we try to reuse shared pages of old stripe_head, and allocate more for the new one if needed. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:45 -07:00
Yufen Yu	046169f048	md/raid5: let multiple devices of stripe_head share page In current implementation, grow_buffers() uses alloc_page() to allocate the buffers for each stripe_head, i.e. allocate a page for each dev[i] in stripe_head. After setting stripe_size as a configurable value by writing sysfs entry, it means that we always allocate 64K buffers, but just use 4K of them when stripe_size is 4K in 64KB arm64. To avoid wasting memory, we try to let multiple sh->dev share one real page. That means, multiple sh->dev[i].page will point to the only page with different offset. Example of 64K PAGE_SIZE and 4K stripe_size as following: 64K PAGE_SIZE +---+---+---+---+------------------------------+ \| \| \| \| \| \| \| \| \| \| +-+-+-+-+-+-+-+-+------------------------------+ ^ ^ ^ ^ \| \| \| +----------------------------+ \| \| \| \| \| \| +-------------------+ \| \| \| \| \| \| +----------+ \| \| \| \| \| \| +-+ \| \| \| \| \| \| \| +-----+-----+------+-----+------+-----+------+------+ sh \| offset(0) \| offset(4K) \| offset(8K) \| offset(12K) \| + +-----------+------------+------------+-------------+ +----> dev[0].page dev[1].page dev[2].page dev[3].page A new 'pages' array will be added into stripe_head to record shared page used by this stripe_head. Allocate them when grow_buffers() and free them when shrink_buffers(). After trying to share page, the users of sh->dev[i].page need to take care of the related page offset: page of issued bio and page passed to xor compution functions. But thanks for previous different page offset supported. Here, we just need to set correct dev[i].offset. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Yufen Yu	4f86ff5580	md/raid6: let async recovery function support different page offset For now, asynchronous raid6 recovery calculate functions are require common offset for pages. But, we expect them to support different page offset after introducing stripe shared page. Do that by simplily adding page offset where each page address are referred. Then, replace the old interface with the new ones in raid6 and raid6test. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Yufen Yu	d69454bc9f	md/raid6: let syndrome computor support different page offset For now, syndrome compute functions require common offset in the pages array. However, we expect them to support different offset when try to use shared page in the following. Simplily covert them by adding page offset where each page address are referred. Since the only caller of async_gen_syndrome() and async_syndrome_val() are in raid6, we don't want to reserve the old interface but modify the interface directly. After that, replacing old interfaces with new ones for raid6 and raid6test. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Yufen Yu	a7c224a820	md/raid5: convert to new xor compution interface We try to replace async_xor() and async_xor_val() with the new introduced interface async_xor_offs() and async_xor_val_offs() for raid456. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Yufen Yu	248728dd04	md/raid5: make async_copy_data() to support different page offset ops_run_biofill() and ops_run_biodrain() will call async_copy_data() to copy sh->dev[i].page from or to bio page. For now, it implies the offset of dev[i].page is 0. But we want to support different page offset in the following. Thus, pass page offset to these functions and replace 'page_offset' with 'page_offset + poff'. No functional change. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Yufen Yu	7aba13b715	md/raid5: add a new member of offset into r5dev Add a new member of offset into struct r5dev. It indicates the offset of related dev[i].page. For now, since each device have a privated page, the value is always 0. Thus, we set offset as 0 when allcate page in grow_buffers() and resize_stripes(). To support following different page offset, we try to use the page offset rather than '0' directly for async_memcpy() and ops_run_io(). We try to support different page offset for xor compution functions in the following. To avoid repeatly allocate a new array each time, we add a memory region into scribble buffer to record offset. No functional change. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Xianting Tian	313b825fa2	md: only calculate blocksize once and use i_blocksize() We alreday has the interface i_blocksize(), which can be used to get blocksize, so use it. Only calculate blocksize once and use it within read_page(). Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Song Liu <songliubraving@fb.com>	2020-09-24 16:44:44 -07:00
Jens Axboe	ac8f7a0264	Merge branch 'for-5.10/block' into for-5.10/drivers * for-5.10/block: (140 commits) bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag bdi: invert BDI_CAP_NO_ACCT_WB bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag mm: use SWP_SYNCHRONOUS_IO more intelligently bdi: remove BDI_CAP_SYNCHRONOUS_IO bdi: remove BDI_CAP_CGROUP_WRITEBACK block: lift setting the readahead size into the block layer md: update the optimal I/O size on reshape bdi: initialize ->ra_pages and ->io_pages in bdi_init aoe: set an optimal I/O size bcache: inherit the optimal I/O size drbd: remove dead code in device_to_statistics fs: remove the unused SB_I_MULTIROOT flag block: mark blkdev_get static PM: mm: cleanup swsusp_swap_check mm: split swap_type_of PM: rewrite is_hibernate_resume_dev to not require an inode mm: cleanup claim_swapfile ocfs2: cleanup o2hb_region_dev_store dasd: cleanup dasd_scan_partitions ...	2020-09-24 13:44:39 -06:00
Christoph Hellwig	1cb039f3dc	bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-24 13:43:39 -06:00
Christoph Hellwig	c2e4cd57cf	block: lift setting the readahead size into the block layer Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-24 13:43:39 -06:00
Christoph Hellwig	16ef510139	md: update the optimal I/O size on reshape The raid5 and raid10 drivers currently update the read-ahead size, but not the optimal I/O size on reshape. To prepare for deriving the read-ahead size from the optimal I/O size make sure it is updated as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-24 13:43:39 -06:00
Christoph Hellwig	5d4ce78b25	bcache: inherit the optimal I/O size Inherit the optimal I/O size setting just like the readahead window, as any reason to do larger I/O does not apply to just readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-24 13:43:38 -06:00
Mike Snitzer	cf9c378655	dm: fix comment in dm_process_bio() Refer to the correct function (->submit_bio instead of ->queue_bio). Also, add details about why using blk_queue_split() isn't needed for dm_wq_work()'s call to dm_process_bio(). Fixes: `c62b37d96b` ("block: move ->make_request_fn to struct block_device_operations") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-21 19:49:15 -04:00
Mike Snitzer	ee1dfad532	dm: fix bio splitting and its bio completion order for regular IO dm_queue_split() is removed because __split_and_process_bio() _must_ handle splitting bios to ensure proper bio submission and completion ordering as a bio is split. Otherwise, multiple recursive calls to ->submit_bio will cause multiple split bios to be allocated from the same ->bio_split mempool at the same time. This would result in deadlock in low memory conditions because no progress could be made (only one bio is available in ->bio_split mempool). This fix has been verified to still fix the loss of performance, due to excess splitting, that commit `120c9257f5` provided. Fixes: `120c9257f5` ("Revert "dm: always call blk_queue_split() in dm_process_bio()"") Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-21 19:42:20 -04:00
Jan Kara	e2ec512825	dm: Call proper helper to determine dax support DM was calling generic_fsdax_supported() to determine whether a device referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that they don't have to duplicate common generic checks. High level code should call dax_supported() helper which that calls into appropriate helper for the particular device. This problem manifested itself as kernel messages: dm-3: error: dax access failed (-95) when lvm2-testsuite run in cases where a DM device was stacked on top of another DM device. Fixes: `7bf7eac8d6` ("dax: Arrange for dax_supported check to span multiple devices") Cc: <stable@vger.kernel.org> Tested-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mike Snitzer <snitzer@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2020-09-20 08:55:09 -07:00
Dan Williams	02186d8897	dm/dax: Fix table reference counts A recent fix to the dm_dax_supported() flow uncovered a latent bug. When dm_get_live_table() fails it is still required to drop the srcu_read_lock(). Without this change the lvm2 test-suite triggers this warning: # lvm2-testsuite --only pvmove-abort-all.sh WARNING: lock held when returning to user space! 5.9.0-rc5+ #251 Tainted: G OE ------------------------------------------------ lvm/1318 is leaving the kernel with locks still held! 1 lock held by lvm/1318: #0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod] ...and later on this hang signature: INFO: task lvm:1344 blocked for more than 122 seconds. Tainted: G OE 5.9.0-rc5+ #251 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000 Call Trace: __schedule+0x45f/0xa80 ? finish_task_switch+0x249/0x2c0 ? wait_for_completion+0x86/0x110 schedule+0x5f/0xd0 schedule_timeout+0x212/0x2a0 ? __schedule+0x467/0xa80 ? wait_for_completion+0x86/0x110 wait_for_completion+0xb0/0x110 __synchronize_srcu+0xd1/0x160 ? __bpf_trace_rcu_utilization+0x10/0x10 __dm_suspend+0x6d/0x210 [dm_mod] dm_suspend+0xf6/0x140 [dm_mod] Fixes: `7bf7eac8d6` ("dax: Arrange for dax_supported check to span multiple devices") Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2020-09-20 08:33:56 -07:00
Song Liu	0806e60f31	bcache: use part_[begin\|end]_io_acct instead of disk_[begin\|end]_io_acct This enables proper statistics in /proc/diskstats for bcache partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-11 16:41:30 -06:00
Song Liu	00fe60eae9	md: use part_[begin\|end]_io_acct instead of disk_[begin\|end]_io_acct This enables proper statistics in /proc/diskstats for md partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-11 16:41:30 -06:00
Christoph Hellwig	818077d6e0	md: use bdev_check_media_change The md driver does not have a ->revalidate_disk method, so it can just use bdev_check_media_change without any additional changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-10 09:32:31 -06:00
Ye Bin	3a653b205f	dm thin metadata: Fix use-after-free in dm_bm_set_read_only The following error ocurred when testing disk online/offline: [ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction [ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction [ 301.849206] Aborting journal on device dm-26-8. [ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted [ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30 [ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data] Reason is: metadata_operation_failed abort_transaction dm_pool_abort_metadata __create_persistent_data_objects r = __open_or_format_metadata if (r) --> If failed will free pmd->bm but pmd->bm not set NULL dm_block_manager_destroy(pmd->bm); set_pool_mode dm_pool_metadata_read_only(pool->pmd); dm_bm_set_read_only(pmd->bm); --> use-after-free Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and dm_bm_set_read_write functions. If bm is NULL it means creating the bm failed and so dm_bm_is_read_only must return true. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-02 13:38:40 -04:00
Ye Bin	219403d7e5	dm thin metadata: Avoid returning cmd->bm wild pointer on error Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-02 13:38:31 -04:00
Ye Bin	d16ff19e69	dm cache metadata: Avoid returning cmd->bm wild pointer on error Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-02 13:38:24 -04:00
Christoph Hellwig	de09077c89	block: remove revalidate_disk() Remove the now unused helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-02 08:00:26 -06:00
Christoph Hellwig	659e56ba86	block: add a new revalidate_disk_size helper revalidate_disk is a relative awkward helper for driver use, as it first calls an optional driver method and then updates the block device size, while most callers either don't need the method call at all, or want to keep state between the caller and the called method. Add a revalidate_disk_size helper that just performs the update of the block device size from the gendisk one, and switch all drivers that do not implement ->revalidate_disk to use the new helper instead of revalidate_disk() Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-02 08:00:07 -06:00
Christoph Hellwig	c2b4bb8cb3	block: fix locking for struct block_device size updates Two different callers use two different mutexes for updating the block device size, which obviously doesn't help to actually protect against concurrent updates from the different callers. In addition one of the locks, bd_mutex is rather prone to deadlocks with other parts of the block stack that use it for high level synchronization. Switch to using a new spinlock protecting just the size updates, as that is all we need, and make sure everyone does the update through the proper helper. This fixes a bug reported with the nvme revalidating disks during a hot removal operation, which can currently deadlock on bd_mutex. Reported-by: Xianting Tian <xianting_tian@126.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-01 16:49:25 -06:00
Mikulas Patocka	e27fec66f0	dm integrity: fix error reporting in bitmap mode after creation The dm-integrity target did not report errors in bitmap mode just after creation. The reason is that the function integrity_recalc didn't clean up ic->recalc_bitmap as it proceeded with recalculation. Fix this by updating the bitmap accordingly -- the double shift serves to rounddown. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `468dfca38b` ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2020-09-01 16:41:57 -04:00

1 2 3 4 5 ...

6541 Коммитов