WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Jens Axboe	cbab6ae0d0	nvme updates for Linux 5.16 - fix a multipath partition scanning deadlock (Hannes Reinecke) - generate uevent once a multipath namespace is operational again (Hannes Reinecke) - support unique discovery controller NQNs (Hannes Reinecke) - fix use-after-free when a port is removed (Israel Rukshin) - clear shadow doorbell memory on resets (Keith Busch) - use struct_size (Len Baker) - add error handling support for add_disk (Luis Chamberlain) - limit the maximal queue size for RDMA controllers (Max Gurtovoy) - use a few more symbolic names (Max Gurtovoy) - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy) - add support for ->map_queues on FC (Saurav Kashyap) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFxYs4LHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYNzGBAAqGhOE7aTrrvsTkx/lc0oZrcS/WxT5zMj1KC7+C8O FT4rFDvLGa4J8PBz+l/u/Dmysw6T70HlDt13WqEy+8l4ckOolAWwoLIqmqaLJM6l 7LA8S0kXlaJr2Wyj1RHn3YatjPhBhBtSxcSI+VwvuMobibUPtTzUEaUYY80+DyGI bWkY1+CzgSXZwhwe72Nf7I5rvkhEvS+pTLsHP70h+AsMlDljUBCNgD9SkvNRciic FFJ90NXXGnmvl0mZiZJ4sfb55r8tqGBvphw+vAkv/Gl9aOyVKmD+9nTAHiFXknPT LAlTidebE09cRVZERg8oooUwvfmFNTRQg/nD+4q9camWgmDqiQtyLrSFvME+ieL0 Cd3zOR7KCRTMhfK5AhdKiXGZ3zu7RznBZ9zNciqZEONob3BxbSs7NagariCVXGvQ KxIA4EE/3nrPmiosXp1/VMVceCJBGJw8wh8TyNX1tkffZR4G+jNihUhT1k2TQlyE KqX9ibN/J0yWWQ/EWqI8r32ox6hIxKjwbtJLgA+wqe3RqF8DjEg6frmvl7c9h4rs aI62XgdF+mMFtDQaYkXtTP63oYiWLQeX8Hkv3Vig2r42U36vlYlhUpIU2Ee1FQZ4 e55pnVCxLQsQBAvVn5vuKd1ivNRynR1NuSeF3NrAtWK33kiziSVTFYFxJiJG8+4Y 1Os= =D1Jt -----END PGP SIGNATURE----- Merge tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme into for-5.16/drivers Pull NVMe updates from Christoph: "nvme updates for Linux 5.16 - fix a multipath partition scanning deadlock (Hannes Reinecke) - generate uevent once a multipath namespace is operational again (Hannes Reinecke) - support unique discovery controller NQNs (Hannes Reinecke) - fix use-after-free when a port is removed (Israel Rukshin) - clear shadow doorbell memory on resets (Keith Busch) - use struct_size (Len Baker) - add error handling support for add_disk (Luis Chamberlain) - limit the maximal queue size for RDMA controllers (Max Gurtovoy) - use a few more symbolic names (Max Gurtovoy) - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy) - add support for ->map_queues on FC (Saurav Kashyap)" * tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme: (23 commits) nvmet: use struct_size over open coded arithmetic nvme: drop scan_lock and always kick requeue list when removing namespaces nvme-pci: clear shadow doorbell memory on resets nvme-rdma: fix error code in nvme_rdma_setup_ctrl nvme-multipath: add error handling support for add_disk() nvmet: use macro definitions for setting cmic value nvmet: use macro definition for setting nmic value nvme: display correct subsystem NQN nvme: Add connect option 'discovery' nvme: expose subsystem type in sysfs attribute 'subsystype' nvmet: set 'CNTRLTYPE' in the identify controller data nvmet: add nvmet_is_disc_subsys() helper nvme: add CNTRLTYPE definitions for 'identify controller' nvmet: make discovery NQN configurable nvmet-rdma: implement get_max_queue_size controller op nvmet: add get_max_queue_size op for controllers nvme-rdma: limit the maximal queue size for RDMA controllers nvmet-tcp: fix use-after-free when a port is removed nvmet-rdma: fix use-after-free when a port is removed nvmet: fix use-after-free when a port is removed ...	2021-10-21 08:25:54 -06:00
Len Baker	117d5b6d00	nvmet: use struct_size over open coded arithmetic As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. In this case this is not actually dynamic size: all the operands involved in the calculation are constant values. However it is better to refactor this anyway, just to keep the open-coded math idiom out of code. So, use the struct_size() helper to do the arithmetic instead of the argument "size + count * size" in the kmalloc() function. This code was detected with the help of Coccinelle and audited and fixed manually. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Signed-off-by: Len Baker <len.baker@gmx.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:23:30 +02:00
Hannes Reinecke	2b81a5f015	nvme: drop scan_lock and always kick requeue list when removing namespaces When reading the partition table on initial scan hits an I/O error the I/O will hang with the scan_mutex held: [<0>] do_read_cache_page+0x49b/0x790 [<0>] read_part_sector+0x39/0xe0 [<0>] read_lba+0xf9/0x1d0 [<0>] efi_partition+0xf1/0x7f0 [<0>] bdev_disk_changed+0x1ee/0x550 [<0>] blkdev_get_whole+0x81/0x90 [<0>] blkdev_get_by_dev+0x128/0x2e0 [<0>] device_add_disk+0x377/0x3c0 [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [<0>] nvme_scan_work+0x168/0x310 [nvme_core] [<0>] process_one_work+0x231/0x420 and trying to delete the controller will deadlock as it tries to grab the scan mutex: [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] As we're now properly ordering the namespace list there is no need to hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore. And we always need to kick the requeue list as the path will be marked as unusable and I/O will be requeued _without_ a current path. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:23:30 +02:00
Keith Busch	58847f12fe	nvme-pci: clear shadow doorbell memory on resets The host memory doorbell and event buffers need to be initialized on each reset so the driver doesn't observe stale values from the previous instantiation. Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: John Levon <john.levon@nutanix.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:23:29 +02:00
Max Gurtovoy	0974812200	nvme-rdma: fix error code in nvme_rdma_setup_ctrl In case that icdoff is not zero or mandatory keyed sgls are not supported by the NVMe/RDMA target, we'll go to error flow but we'll return 0 to the caller. Fix it by returning an appropriate error code. Fixes: `c66e2998c8` ("nvme-rdma: centralize controller setup sequence") Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:03 +02:00
Luis Chamberlain	11384580e3	nvme-multipath: add error handling support for add_disk() We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Since we now can tell for sure when a disk was added, move setting the bit NVME_NSHEAD_DISK_LIVE only when we did add the disk successfully. Nothing to do here as the cleanup is done elsewhere. We take care and use test_and_set_bit() because it is protects against two nvme paths simultaneously calling device_add_disk() on the same namespace head. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:03 +02:00
Max Gurtovoy	d56ae18f06	nvmet: use macro definitions for setting cmic value This makes the code more readable. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:03 +02:00
Max Gurtovoy	571b5444d1	nvmet: use macro definition for setting nmic value This makes the code more readable. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	e5ea42faa7	nvme: display correct subsystem NQN With discovery controllers supporting unique subsystem NQNs the actual subsystem NQN might be different from that one passed in via the connect args. So add a helper to display the resulting subsystem NQN. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	20e8b689c9	nvme: Add connect option 'discovery' Add a connect option 'discovery' to specify that the connection should be made to a discovery controller, not a normal I/O controller. With discovery controllers supporting unique subsystem NQNs we cannot easily distinguish by the subsystem NQN if this should be a discovery connection, but we need this information to blank out options not supported by discovery controllers. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	954ae16681	nvme: expose subsystem type in sysfs attribute 'subsystype' With unique discovery controller NQNs we cannot distinguish the subsystem type by the NQN alone, but need to check the subsystem type, too. So expose the subsystem type in a new sysfs attribute 'subsystype'. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	d3aef70124	nvmet: set 'CNTRLTYPE' in the identify controller data Set the correct 'CNTRLTYPE' field in the identify controller data. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	a294711ed5	nvmet: add nvmet_is_disc_subsys() helper Add a helper function to determine if a given subsystem is a discovery subsystem. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:02 +02:00
Hannes Reinecke	e15a8a9755	nvme: add CNTRLTYPE definitions for 'identify controller' Update the 'identify controller' structure to define the newly added CNTRLTYPE field. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:01 +02:00
Hannes Reinecke	626851e922	nvmet: make discovery NQN configurable TPAR8013 allows for unique discovery NQNs, so make the discovery controller NQN configurable by exposing a subsys attribute 'discovery_nqn'. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:01 +02:00
Max Gurtovoy	c7d792f9b8	nvmet-rdma: implement get_max_queue_size controller op Limit the maximal queue size for RDMA controllers. Today, the target reports a limit of 1024 and this limit isn't valid for some of the RDMA based controllers. For now, limit RDMA transport to 128 entries (the max queue depth configured for Linux NVMe/RDMA host). Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Reported-by: Mark Ruijter <mruijter@primelogic.nl> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:01 +02:00
Max Gurtovoy	6d1555cc41	nvmet: add get_max_queue_size op for controllers Some transports, such as RDMA, would like to set the queue size according to device/port/ctrl characteristics. Add a new nvmet transport op that is called during ctrl initialization. This will not effect transports that don't implement this option. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:01 +02:00
Max Gurtovoy	44c3c6257e	nvme-rdma: limit the maximal queue size for RDMA controllers Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In case the target expose a cap of larger amount of entries (e.g. 1024), the initiator may fail to create a QP with this size. Thus limit to a value that works for all RDMA adapters. Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:01 +02:00
Israel Rukshin	2351ead99c	nvmet-tcp: fix use-after-free when a port is removed When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the accept_work was cancelled guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Israel Rukshin	fcf73a804c	nvmet-rdma: fix use-after-free when a port is removed When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the RDMA-CM was destroyed guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Israel Rukshin	e3e19dcc4c	nvmet: fix use-after-free when a port is removed When a port is removed through configfs, any connected controllers are starting teardown flow asynchronously and can still send commands. This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_parse_io_cmd). To fix this, wait for all the teardown scheduled works to complete (like release_work at rdma/tcp drivers). This ensures there are no active controllers when the port is eventually removed. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Saurav Kashyap	2b2af50ae8	qla2xxx: add ->map_queues support for nvme Implement ->map queues and use the block layer blk_mq_pci_map_queues helper for mapping queues to CPUs. With this mapping minimum 10%+ increase in performance is noticed. Signed-off-by: Saurav Kashyap <skashyap@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Saurav Kashyap	01d838164b	nvme-fc: add support for ->map_queues NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP transports. Add a ->map_queues callout for the LLDDs to provide such functionality. Signed-off-by: Saurav Kashyap <skashyap@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Hannes Reinecke	f6f09c15a7	nvme: generate uevent once a multipath namespace is operational again When fast_io_fail_tmo is set I/O will be aborted while recovery is still ongoing. This causes MD to set the namespace to failed, and no futher I/O will be submitted to that namespace. However, once the recovery succeeds and the namespace becomes operational again the NVMe subsystem doesn't send a notification, so MD cannot automatically reinstate operation and requires manual interaction. This patch will send a KOBJ_CHANGE uevent per multipathed namespace once the underlying controller transitions to LIVE, allowing an automatic MD reassembly with these udev rules: /etc/udev/rules.d/65-md-auto-re-add.rules: SUBSYSTEM!="block", GOTO="md_end" ACTION!="change", GOTO="md_end" ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end" PROGRAM="/sbin/md_raid_auto_readd.sh $devnode" LABEL="md_end" /sbin/md_raid_auto_readd.sh: MDADM=/sbin/mdadm DEVNAME=$1 export $(${MDADM} --examine --export ${DEVNAME}) if [ -z "${MD_UUID}" ]; then exit 1 fi UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID}) MD_DEVNAME=${UUID_LINK##/} export $(${MDADM} --detail --export /dev/${MD_DEVNAME}) if [ -z "${MD_METADATA}" ] ; then exit 1 fi if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then echo "${MD_DEVNAME}: array not degraded, nothing to do" exit 0 fi MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state) if [ ${MD_STATE} != "clean" ] ; then echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add" exit 1 fi MD_VARNAME="MD_DEVICE_dev_${DEVNAME##/}_ROLE" if [ ${!MD_VARNAME} = "spare" ] ; then ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME} fi Changes to v2: - Add udev rules example to description Changes to v1: - use disk_uevent() as suggested by hch Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-10-20 19:16:00 +02:00
Christoph Hellwig	39fa7a9555	bcache: remove bch_crc64_update bch_crc64_update is an entirely pointless wrapper around crc64_be. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-9-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Christoph Hellwig	00387bd21d	bcache: use bvec_kmap_local in bch_data_verify Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit. Also switch from page_address to bvec_kmap_local for cbv to be on the safe side and to avoid pointlessly poking into bvec internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Christoph Hellwig	0f5cd7815f	bcache: remove the backing_dev_name field from struct cached_dev Just use the %pg format specifier to print the name directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Christoph Hellwig	7e84c21507	bcache: remove the cache_dev_name field from struct cache Just use the %pg format specifier to print the name directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Lin Feng	0259d4498b	bcache: move calc_cached_dev_sectors to proper place on backing device detach Calculation of cache_set's cached sectors is done by travelling cached_devs list as shown below: static void calc_cached_dev_sectors(struct cache_set *c) { ... list_for_each_entry(dc, &c->cached_devs, list) sectors += bdev_sectors(dc->bdev); c->cached_dev_sectors = sectors; } But cached_dev won't be unlinked from c->cached_devs list until we call following list_move(&dc->list, &uncached_devices), so previous fix in 'commit `46010141da` ("bcache: recal cached_dev_sectors on detach")' is wrong, now we move it to its right place. Signed-off-by: Lin Feng <linf@wangsu.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Chao Yu	d55f7cb2e5	bcache: fix error info in register_bcache() In register_bcache(), there are several cases we didn't set correct error info (return value and/or error message): - if kzalloc() fails, it needs to return ENOMEM and print "cannot allocate memory"; - if register_cache() fails, it's better to propagate its return value rather than using default EINVAL. Signed-off-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Coly Li	0a2b3e3635	bcache: reserve never used bits from bkey.high There sre 3 bits in member high of struct bkey are never used, and no plan to support them in future, - HEADER_SIZE, start at bit 58, length 2 bits - KEY_PINNED, start at bit 55, length 1 bit No any kernel code, or user space tool references or accesses the three bits. Therefore it is possible and feasible to reserve the valuable bits from bkey.high. They can be used in future for other purpose. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Ding Senjie	a307e2abfc	md: bcache: Fix spelling of 'acquire' acqurie -> acquire Signed-off-by: Ding Senjie <dingsenjie@yulong.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20211020143812.6403-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:40:54 -06:00
Stefan Haberland	a8e5d491df	s390/dasd: fix possibly missed path verification __dasd_device_check_path_events() calls the discipline path event handler. This handler can leave the 'to be verified pathmask' populated for an additional verification. There is a race window where the worker has finished before dasd_path_clear_all_verify() is called which resets the tbvpm. Due to this there could be outstanding path verifications missed. Fix by clearing the pathmasks before calling the handler and add them again in case of an error. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-8-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:42 -06:00
Stefan Haberland	9dffede011	s390/dasd: fix missing path conf_data after failed allocation dasd_eckd_path_available_action() does a memory allocation to store the per path configuration data permanently. In the unlikely case that this allocation fails there is no conf_data stored for the corresponding path. This is OK since this is not necessary for an operational path but some features like control unit initiated reconfiguration (CUIR) do not work. To fix this add the path to the 'to be verified pathmask' again and schedule the handler again. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-7-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:42 -06:00
Stefan Haberland	542e30ce8e	s390/dasd: summarize dasd configuration data in a separate structure Summarize the dasd configuration data in a separate structure so that functions that need temporary config data do not need to allocate the whole eckd_private structure. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-6-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:42 -06:00
Stefan Haberland	74e2f21102	s390/dasd: move dasd_eckd_read_fc_security dasd_eckd_read_conf is called multiple times during device setup but the fc_security feature needs to be read only once. So move it into the calling function. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-5-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:42 -06:00
Stefan Haberland	23596961b4	s390/dasd: split up dasd_eckd_read_conf Move the cabling check out of dasd_eckd_read_conf and split it up into separate functions to improve readability and re-use functions. Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-4-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:41 -06:00
Heiko Carstens	10c78e53ee	s390/dasd: fix kernel doc comment Fix this: drivers/s390/block/dasd_ioctl.c:666: warning: Function parameter or member 'disk' not described in 'dasd_biodasdinfo' drivers/s390/block/dasd_ioctl.c:666: warning: Function parameter or member 'info' not described in 'dasd_biodasdinfo' Acked-by: Jan Höppner <hoeppner@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:41 -06:00
Heiko Carstens	169bbdacaa	s390/dasd: handle request magic consistently as unsigned int Get rid of the rather odd casts to character pointer of the dasd_ccw_req magic member and simply use the unsigned int value unmodified everywhere. Acked-by: Jan Höppner <hoeppner@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20211020115124.1735254-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:10:41 -06:00
Ye Bin	0c98057be9	nbd: Fix use-after-free in pid_show I got issue as follows: [ 263.886511] BUG: KASAN: use-after-free in pid_show+0x11f/0x13f [ 263.888359] Read of size 4 at addr ffff8880bf0648c0 by task cat/746 [ 263.890479] CPU: 0 PID: 746 Comm: cat Not tainted 4.19.90-dirty #140 [ 263.893162] Call Trace: [ 263.893509] dump_stack+0x108/0x15f [ 263.893999] print_address_description+0xa5/0x372 [ 263.894641] kasan_report.cold+0x236/0x2a8 [ 263.895696] __asan_report_load4_noabort+0x25/0x30 [ 263.896365] pid_show+0x11f/0x13f [ 263.897422] dev_attr_show+0x48/0x90 [ 263.898361] sysfs_kf_seq_show+0x24d/0x4b0 [ 263.899479] kernfs_seq_show+0x14e/0x1b0 [ 263.900029] seq_read+0x43f/0x1150 [ 263.900499] kernfs_fop_read+0xc7/0x5a0 [ 263.903764] vfs_read+0x113/0x350 [ 263.904231] ksys_read+0x103/0x270 [ 263.905230] __x64_sys_read+0x77/0xc0 [ 263.906284] do_syscall_64+0x106/0x360 [ 263.906797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce this issue as follows: 1. nbd-server 8000 /tmp/disk 2. nbd-client localhost 8000 /dev/nbd1 3. cat /sys/block/nbd1/pid Then trigger use-after-free in pid_show. Reason is after do step '2', nbd-client progress is already exit. So it's task_struct already freed. To solve this issue, revert part of 6521d39a64b3's modify and remove useless 'recv_task' member of nbd_device. Fixes: `6521d39a64` ("nbd: Remove variable 'pid'") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20211020073959.2679255-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-20 08:09:56 -06:00
Jens Axboe	a9a7e30fd9	nvme: don't memset() the normal read/write command This memset in the fast path costs a lot of cycles on my setup. Here's a top-of-profile of doing ~6.7M IOPS: + 5.90% io_uring [nvme] [k] nvme_queue_rq + 5.32% io_uring [nvme_core] [k] nvme_setup_cmd + 5.17% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.97% io_uring [kernel.vmlinux] [k] blkdev_direct_IO and a perf diff with this patch: 0.92% +4.40% [nvme_core] [k] nvme_setup_cmd reducing it from 5.3% to only 0.9%. This takes it from the 2nd most cycle consumer to something that's mostly irrelevant. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 12:41:09 -06:00
Jens Axboe	9c3d29296f	nvme: move command clear into the various setup helpers We don't have to worry about doing extra memsets by moving it outside the protection of RQF_DONTPREP, as nvme doesn't do partial completions. This is in preparation for making the read/write fast path not do a full memset of the command. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 12:40:51 -06:00
Michael Schmitz	86d46fdaa1	block: ataflop: fix breakage introduced at blk-mq refactoring Refactoring of the Atari floppy driver when converting to blk-mq has broken the state machine in not-so-subtle ways: finish_fdc() must be called when operations on the floppy device have completed. This is crucial in order to relase the ST-DMA lock, which protects against concurrent access to the ST-DMA controller by other drivers (some DMA related, most just related to device register access - broken beyond compare, I know). When rewriting the driver's old do_request() function, the fact that finish_fdc() was called only when all queued requests had completed appears to have been overlooked. Instead, the new request function calls finish_fdc() immediately after the last request has been queued. finish_fdc() executes a dummy seek after most requests, and this overwrites the state machine's interrupt hander that was set up to wait for completion of the read/write request just prior. To make matters worse, finish_fdc() is called before device interrupts are re-enabled, making certain that the read/write interupt is missed. Shifting the finish_fdc() call into the read/write request completion handler ensures the driver waits for the request to actually complete. With a queue depth of 2, we won't see long request sequences, so calling finish_fdc() unconditionally just adds a little overhead for the dummy seeks, and keeps the code simple. While we're at it, kill ataflop_commit_rqs() which does nothing but run finish_fdc() unconditionally, again likely wiping out an in-flight request. Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Fixes: `6ec3938cff` ("ataflop: convert to blk-mq") CC: linux-block@vger.kernel.org CC: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Link: https://lore.kernel.org/r/20211019061321.26425-1-schmitzmic@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 06:11:44 -06:00
Yu Kuai	8663b210f8	nbd: fix uaf in nbd_handle_reply() There is a problem that nbd_handle_reply() might access freed request: 1) At first, a normal io is submitted and completed with scheduler: internel_tag = blk_mq_get_tag -> get tag from sched_tags blk_mq_rq_ctx_init sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] ... blk_mq_get_driver_tag __blk_mq_get_driver_tag -> get tag from tags tags->rq[tag] = sched_tag->static_rq[internel_tag] So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing to the request: sched_tags->static_rq[internal_tag]. Even if the io is finished. 2) nbd server send a reply with random tag directly: recv_work nbd_handle_reply blk_mq_tag_to_rq(tags, tag) rq = tags->rq[tag] 3) if the sched_tags->static_rq is freed: blk_mq_sched_free_requests blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) -> step 2) access rq before clearing rq mapping blk_mq_clear_rq_mapping(set, tags, hctx_idx); __free_pages() -> rq is freed here 4) Then, nbd continue to use the freed request in nbd_handle_reply Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(), thus request is ensured not to be freed because 'q_usage_counter' is not zero. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210916141810.2325276-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	3fe1db626a	nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply() Prepare to fix uaf in nbd_read_stat(), no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-7-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	f52c0e0823	nbd: clean up return value checking of sock_xmit() Check if sock_xmit() return 0 is useless because it'll never return 0, comment it and remove such checkings. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-6-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	0de2b7a4dd	nbd: don't start request if nbd_queue_rq() failed commit `6a468d5990` ("nbd: don't start req until after the dead connection logic") move blk_mq_start_request() from nbd_queue_rq() to nbd_handle_cmd() to skip starting request if the connection is dead. However, request is still started in other error paths. Currently, blk_mq_end_request() will be called immediately if nbd_queue_rq() failed, thus start request in such situation is useless. So remove blk_mq_start_request() from error paths in nbd_handle_cmd(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-5-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	fcf3d633d8	nbd: check sock index in nbd_read_stat() The sock that clent send request in nbd_send_cmd() and receive reply in nbd_read_stat() should be the same. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-4-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	07175cb1ba	nbd: make sure request completion won't concurrent commit `cddce01160` ("nbd: Aovid double completion of a request") try to fix that nbd_clear_que() and recv_work() can complete a request concurrently. However, the problem still exists: t1 t2 t3 nbd_disconnect_and_put flush_workqueue recv_work blk_mq_complete_request blk_mq_complete_request_remote -> this is true WRITE_ONCE(rq->state, MQ_RQ_COMPLETE) blk_mq_raise_softirq blk_done_softirq blk_complete_reqs nbd_complete_rq blk_mq_end_request blk_mq_free_request WRITE_ONCE(rq->state, MQ_RQ_IDLE) nbd_clear_que blk_mq_tagset_busy_iter nbd_clear_req __blk_mq_free_request blk_mq_put_tag blk_mq_complete_request -> complete again There are three places where request can be completed in nbd: recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they all hold cmd->lock before completing the request, it's easy to avoid the problem by setting and checking a cmd flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00
Yu Kuai	4e6eef5dc2	nbd: don't handle response without a corresponding request message While handling a response message from server, nbd_read_stat() will try to get request by tag, and then complete the request. However, this is problematic if nbd haven't sent a corresponding request message: t1 t2 submit_bio nbd_queue_rq blk_mq_start_request recv_work nbd_read_stat blk_mq_tag_to_rq blk_mq_complete_request nbd_send_cmd Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in nbd_send_cmd() and checked in nbd_read_stat(). Noted that this patch can't fix that blk_mq_tag_to_rq() might return a freed request, and this will be fixed in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:50:37 -06:00

1 2 3 4 5 ...

1044679 Коммитов Все ветки Поиск

1044679 Коммитов

Все ветки