Граф коммитов

2226 Коммитов

Автор SHA1 Сообщение Дата
Max Gurtovoy da3340e77e nvme: remove unneeded include from constants file
No usage of blkdev.h elements.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-05-16 08:06:59 +02:00
Max Gurtovoy ca2d89925a nvme: add missing status values to verbose logging
Log a few more path related status codes.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-05-16 08:06:58 +02:00
Keith Busch 52fde2c07d nvme: set dma alignment to dword
The nvme specification only requires qword alignment for segment
descriptors, and the driver already guarantees that. The spec has always
allowed user data to be dword aligned, which is what the queue's
attribute is for, so relax the alignment requirement to that value.

While we could allow byte alignment for some controllers when using
SGLs, we still need to support PRP, and that only allows dword.

Fixes: 3b2a1ebceb ("nvme: set dma alignment to qword")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-05-16 08:06:58 +02:00
Tom Yan 1a86924e4f nvme: fix interpretation of DMRSL
DMRSLl is in the unit of logical blocks, while max_discard_sectors is
in the unit of "linux sector".

Signed-off-by: Tom Yan <tom.ty89@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-05-16 08:06:54 +02:00
Anuj Gupta f569add471 nvme: add vectored-io support for uring-cmd
wire up support for async passthru that takes an array of buffers (using
iovec). Exposed via a new op NVME_URING_CMD_IO_VEC. Same 'struct
nvme_uring_cmd' is to be used with -

1. cmd.addr as base address of user iovec array
2. cmd.data_len as count of iovec array elements

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-6-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 07:41:19 -06:00
Kanchan Joshi 456cba386e nvme: wire-up uring-cmd support for io-passthru on char-device.
Introduce handler for fops->uring_cmd(), implementing async passthru
on char device (/dev/ngX). The handler supports newly introduced
operation NVME_URING_CMD_IO. This operates on a new structure
nvme_uring_cmd, which is similar to struct nvme_passthru_cmd64 but
without the embedded 8b result field. This field is not needed since
uring-cmd allows to return additional result via big-CQE.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-5-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 07:41:13 -06:00
Christoph Hellwig bcad2565b5 nvme: refactor nvme_submit_user_cmd()
Divide the work into two helpers, namely nvme_alloc_user_request and
nvme_execute_user_rq. This is a prep patch, to help wiring up
uring-cmd support in nvme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: fold in fix for assuming bio is non-NULL]
Link: https://lore.kernel.org/r/20220511054750.20432-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 07:41:05 -06:00
Arnd Bergmann 7ad7ab9036 nvme-apple: fix sparse endianess warnings
The new nvme-apple driver is missing a few conversions to and
from little-endian data:

drivers/nvme/host/apple.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned long long [usertype] prp1 @@     got restricted __le64 [usertype] prp1 @@
drivers/nvme/host/apple.c:291:19: sparse:     expected unsigned long long [usertype] prp1
drivers/nvme/host/apple.c:291:19: sparse:     got restricted __le64 [usertype] prp1
drivers/nvme/host/apple.c:292:19: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned long long [usertype] prp2 @@     got restricted __le64 [usertype] prp2 @@
drivers/nvme/host/apple.c:293:21: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] length @@     got restricted __le16 [usertype] length @@
drivers/nvme/host/apple.c:351:52: sparse: sparse: incorrect type in initializer (different base types) @@     expected unsigned int [usertype] next_dma_addr @@     got restricted __le64 [usertype] @@
drivers/nvme/host/apple.c:456:45: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __le64 [usertype] @@     got unsigned int [addressable] [usertype] prp_dma @@
drivers/nvme/host/apple.c:459:31: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __le64 [usertype] @@     got unsigned long long [assigned] [usertype] dma_addr @@
drivers/nvme/host/apple.c:474:25: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __le64 [usertype] prp1 @@     got unsigned int [usertype] dma_address @@
drivers/nvme/host/apple.c:475:25: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __le64 [usertype] prp2 @@     got unsigned int [usertype] first_dma @@

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-05-06 21:06:25 +02:00
Arnd Bergmann bdc33dad34 Apple SoC NVMe driver and dependencies:
- RTKit IPC library required to boot and communicate with
     co-processors embedded inside Apple SoCs
   - SART DMA address filter required to allow some DMA transactions for
     the NVMe co-processor
   - NVMe platform driver
 
 The following minor changes since v3 on the mailing list have been
 folded in:
 
   - sart: %llx -> %pa for a phys_addr_t
   - rtkit:/sart: Drop IS_ENABLED inside headers
   - rtkit: Use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL
   - nvme: Set NVME_REQ_CANCELLED in the timeout handler
   - nvme: Use DEFINE_SIMPLE_DEV_PM_OPS instead of #ifdef CONFIG_PM_SLEEP
 -----BEGIN PGP SIGNATURE-----
 
 iIkEABYKADEWIQRUAfN3OZ14tapALxsd6alOrecRcQUCYm/5lhMcc3ZlbkBzdmVu
 cGV0ZXIuZGV2AAoJEB3pqU6t5xFxITsBAI02bdBvHQAFouu1WzhLzlAPNqjGv3Th
 YFOJL4y2oyigAP9zkIGP0e1JcpApPxvlmQnM340ZY+TCth+ahoCWlS4VAg==
 =b0QX
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmJ0IIIACgkQmmx57+YA
 GNnxlg/+M0d7awpSyiN2g2aoPN9PzaDiWgjSDJ30THfY+JodFmK/Uw1L2dO32+md
 0UMuMH9Vtiy3NXf/dDoLU+kiNQgT1ZY7Iw2220GZyHxRUC/7RZ/zGNvAb1FclDhK
 YEYtlA2fK1MvlhEcriD8ytP78R80UA7Cx4o5T8CkgfN2OtjLip72aMCFtzRfxMk6
 W0GQvkTpDtp3qn59Wlz/JEJKMT+q0txxq1QkPW93Su190S/7Lk/nd+CkmFDHx54n
 2LFa33JuZZfNbzPMN9emTl7G5Z7Y241zYBMHvtAzd3WdaDZ6I1apeVnDXrNRhysC
 4B8si4qbo9NORhWR96okll0MCiBNZ1UfvS4U+/8A1b0CHwX5rEUKLNOXIfO8xApi
 L3HvVUpYPJw8w3aK2SKGVt2aQcnbluy9+LJQgHByPFbeIV80OjWoR4xSSZi2XErl
 tp4vmyUBTR8g85EnCqcc8OWCdl+yMjX+NmftLk/XB7/xD+HtqKocqizOVho/9u+K
 rv6Ivx6JHk33LFG/jJkN6q8OncgfSRirg0oy6IF/dy2xS2U3OtfaIXHIy9NosC36
 B3bm8mb3CVQJxazMYl3rstBGQoN1YpuwOsHRUWiWG1jDnXFwCSck9b5OH+y64PME
 45V27Rx1ftTgrOWgsllIqsviWADO4Qj7n5YXYvC+hGrBeDnM8pQ=
 =gi38
 -----END PGP SIGNATURE-----

Merge tag 'asahi-soc-rtkit-sart-nvme-for-5.19' of https://github.com/AsahiLinux/linux into arm/drivers

Apple SoC NVMe driver and dependencies:

  - RTKit IPC library required to boot and communicate with
    co-processors embedded inside Apple SoCs
  - SART DMA address filter required to allow some DMA transactions for
    the NVMe co-processor
  - NVMe platform driver

The following minor changes since v3 on the mailing list have been
folded in:

  - sart: %llx -> %pa for a phys_addr_t
  - rtkit:/sart: Drop IS_ENABLED inside headers
  - rtkit: Use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL
  - nvme: Set NVME_REQ_CANCELLED in the timeout handler
  - nvme: Use DEFINE_SIMPLE_DEV_PM_OPS instead of #ifdef CONFIG_PM_SLEEP

* tag 'asahi-soc-rtkit-sart-nvme-for-5.19' of https://github.com/AsahiLinux/linux:
  nvme-apple: Add initial Apple SoC NVMe driver
  dt-bindings: nvme: Add Apple ANS NVMe
  soc: apple: Add SART driver
  dt-bindings: iommu: Add Apple SART DMA address filter
  soc: apple: Add RTKit IPC library
  soc: apple: Always include Makefile

Link: https://lore.kernel.org/r/20220505154020.84638-1-sven@svenpeter.dev
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-05-05 21:07:46 +02:00
Christoph Hellwig 4e7f0ece41 nvme: remove a spurious clear of discard_alignment
The nvme driver never sets a discard_alignment, so it also doens't need
to clear it to zero.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220418045314.360785-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-03 10:38:50 -06:00
Christoph Hellwig c814153c83 nvme-fc: fold t fc_update_appid into fc_appid_store
No need for this wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Christoph Hellwig 55d7baa371 nvme-fc: don't support the appid attribute without CONFIG_BLK_CGROUP_FC_APPID
nvme-fc appid support needs CONFIG_BLK_CGROUP_FC_APPID to work, so disable
the whole code if the option is not set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 14:06:20 -06:00
Sven Peter 5bd2927ace nvme-apple: Add initial Apple SoC NVMe driver
Apple SoCs such as the M1 come with an embedded NVMe controller that
is not attached to any PCIe bus. Additionally, it doesn't conform
to the NVMe specification and requires a bunch of changes to command
submission and IOMMU configuration to work.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sven Peter <sven@svenpeter.dev>
2022-05-02 17:24:45 +02:00
Christoph Hellwig 70200574cc block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig 66dd346b84 nvme-pci: disable namespace identifiers for Qemu controllers
Qemu unconditionally reports a UUID, which depending on the qemu version
is either all-null (which is incorrect but harmless) or contains a single
bit set for all controllers.  In addition it can also optionally report
a eui64 which needs to be manually set.  Disable namespace identifiers
for Qemu controlles entirely even if in some cases they could be set
correctly through manual intervention.

Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-04-15 06:56:17 +02:00
Christoph Hellwig a98a945b80 nvme-pci: disable namespace identifiers for the MAXIO MAP1002/1202
The MAXIO MAP1002/1202 controllers reports completely bogus Namespace
identifiers that even change after suspend cycles.  Disable using
the Identifiers entirely.

Reported-by: 金韬 <me@kingtous.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: 金韬 <me@kingtous.cn>
2022-04-15 06:56:17 +02:00
Christoph Hellwig 00ff400e6d nvme: add a quirk to disable namespace identifiers
Add a quirk to disable using and exporting namespace identifiers for
controllers where they are broken beyond repair.

The most directly visible problem with non-unique namespace identifiers
is that they break the /dev/disk/by-id/ links, with the link for a
supposedly unique identifier now pointing to one of multiple possible
namespaces that share the same ID, and a somewhat random selection of
which one actually shows up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-04-15 06:56:17 +02:00
Chaitanya Kulkarni b42b6f4485 nvme: don't print verbose errors for internal passthrough requests
Use the RQF_QUIET flag to skip the newly added verbose error reporting,
and set the flag in __nvme_submit_sync_cmd, which is used for most
internal passthrough requests where we do expect errors (e.g. due to
probing for optional functionality).  This is similar to what the SCSI
verbose error logging does.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Alan Adamson <alan.adamson@oracle.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-04-15 06:56:14 +02:00
Jason Gunthorpe e945c653c8 RDMA: Split kernel-only global device caps from uverbs device caps
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.

This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.

Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.

Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-04-06 15:02:13 -03:00
Linus Torvalds 8467b0ed6c for-5.18/drivers-2022-04-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJHUgMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgMOD/9J9mO8CQ3THnyJeZb8hy+k7fPHw+P3OrAR
 umOea3ujoMqzJsd/aRenMMAHsr7Phnb5PmljvryOo59nvwxOZ5MIBzSf2H4qJ8U2
 B4jGwESVW4OFNS6Mu+lgYH7XMyDHvqCSVdIhcnqkseoFyndpnTfsu4cCphqajVaP
 gOmXLBSQAetULxMfbqm7ofKk8F7zA80LFbwVs1VWVnCMeLVDccmJJbfn97jDZaJc
 rl8xmcvmarYLOTxDoOSdmfp4ek7QzRKQuKDlfvn1Xi+lDtkKAnYygMHhqQ0WYYc1
 /jJEd3iLCeV0jYfsDpVq6n2KRAGPCtrP0HkujifMGtuL5N2MAn/Aq3Fqoztas1yK
 p3T3SBIBVeznTtOXxl4Fm8tvim3i3rxn2vPdYnm/8uuNxqCQy78gVf5bPlLY+bzT
 4ytrP7AUpyzFj5E+8F33mZd0Vj2AL1kvgjbfEWqyQdXu7zs98UJL3xWicLbrvt/E
 nmdlZjOOBbEgV3vGQD5wTRvlsBJswtl4mHBpYYLzZtmDr8wZxHD3DCUuM1i4xyHG
 qDxNTKME2KfCXA8DIlcPxOAnNeXtwW3J7KTDuPDwC6XW84hFiC2tkM5M8aWqRos+
 GiRczSArhaomaGq8W+vRqgCnPQgFAIt+oyJ9aqen0xHG8qCOTv3N6mMTJk/uBnMI
 qcfpguHp+g==
 =0Axq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/drivers-2022-04-01' of git://git.kernel.dk/linux-block

Pull block driver fixes from Jens Axboe:
 "Followup block driver updates and fixes for the 5.18-rc1 merge window.
  In detail:

   - NVMe pull request
       - Fix multipath hang when disk goes live over reconnect (Anton
         Eidelman)
       - fix RCU hole that allowed for endless looping in multipath
         round robin (Chris Leech)
       - remove redundant assignment after left shift (Colin Ian King)
       - add quirks for Samsung X5 SSDs (Monish Kumar R)
       - fix the read-only state for zoned namespaces with unsupposed
         features (Pankaj Raghav)
       - use a private workqueue instead of the system workqueue in
         nvmet (Sagi Grimberg)
       - allow duplicate NSIDs for private namespaces (Sungup Moon)
       - expose use_threaded_interrupts read-only in sysfs (Xin Hao)"

   - nbd minor allocation fix (Zhang)

   - drbd fixes and maintainer addition (Lars, Jakob, Christoph)

   - n64cart build fix (Jackie)

   - loop compat ioctl fix (Carlos)

   - misc fixes (Colin, Dongli)"

* tag 'for-5.18/drivers-2022-04-01' of git://git.kernel.dk/linux-block:
  drbd: remove check of list iterator against head past the loop body
  drbd: remove usage of list iterator variable after loop
  nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
  MAINTAINERS: add drbd co-maintainer
  drbd: fix potential silent data corruption
  loop: fix ioctl calls using compat_loop_info
  nvme-multipath: fix hang when disk goes live over reconnect
  nvme: fix RCU hole that allowed for endless looping in multipath round robin
  nvme: allow duplicate NSIDs for private namespaces
  nvmet: remove redundant assignment after left shift
  nvmet: use a private workqueue instead of the system workqueue
  nvme-pci: add quirks for Samsung X5 SSDs
  nvme-pci: expose use_threaded_interrupts read-only in sysfs
  nvme: fix the read-only state for zoned namespaces with unsupposed features
  n64cart: convert bi_disk to bi_bdev->bd_disk fix build
  xen/blkfront: fix comment for need_copy
  xen-blkback: remove redundant assignment to variable i
2022-04-01 16:26:57 -07:00
Anton Eidelman a4a6f3c8f6 nvme-multipath: fix hang when disk goes live over reconnect
nvme_mpath_init_identify() invoked from nvme_init_identify() fetches a
fresh ANA log from the ctrl.  This is essential to have an up to date
path states for both existing namespaces and for those scan_work may
discover once the ctrl is up.

This happens in the following cases:
  1) A new ctrl is being connected.
  2) An existing ctrl is successfully reconnected.
  3) An existing ctrl is being reset.

While in (1) ctrl->namespaces is empty, (2 & 3) may have namespaces, and
nvme_read_ana_log() may call nvme_update_ns_ana_state().

This result in a hang when the ANA state of an existing namespace changes
and makes the disk live: nvme_mpath_set_live() issues IO to the namespace
through the ctrl, which does NOT have IO queues yet.

See sample hang below.

Solution:
- nvme_update_ns_ana_state() to call set_live only if ctrl is live
- nvme_read_ana_log() call from nvme_mpath_init_identify()
  therefore only fetches and parses the ANA log;
  any erros in this process will fail the ctrl setup as appropriate;
- a separate function nvme_mpath_update()
  is called in nvme_start_ctrl();
  this parses the ANA log without fetching it.
  At this point the ctrl is live,
  therefore, disks can be set live normally.

Sample failure:
    nvme nvme0: starting error recovery
    nvme nvme0: Reconnecting in 10 seconds...
    block nvme0n6: no usable path - requeuing I/O
    INFO: task kworker/u8:3:312 blocked for more than 122 seconds.
          Tainted: G            E     5.14.5-1.el7.elrepo.x86_64 #1
    Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
    Call Trace:
     __schedule+0x2a2/0x7e0
     schedule+0x4e/0xb0
     io_schedule+0x16/0x40
     wait_on_page_bit_common+0x15c/0x3e0
     do_read_cache_page+0x1e0/0x410
     read_cache_page+0x12/0x20
     read_part_sector+0x46/0x100
     read_lba+0x121/0x240
     efi_partition+0x1d2/0x6a0
     bdev_disk_changed.part.0+0x1df/0x430
     bdev_disk_changed+0x18/0x20
     blkdev_get_whole+0x77/0xe0
     blkdev_get_by_dev+0xd2/0x3a0
     __device_add_disk+0x1ed/0x310
     device_add_disk+0x13/0x20
     nvme_mpath_set_live+0x138/0x1b0 [nvme_core]
     nvme_update_ns_ana_state+0x2b/0x30 [nvme_core]
     nvme_update_ana_state+0xca/0xe0 [nvme_core]
     nvme_parse_ana_log+0xac/0x170 [nvme_core]
     nvme_read_ana_log+0x7d/0xe0 [nvme_core]
     nvme_mpath_init_identify+0x105/0x150 [nvme_core]
     nvme_init_identify+0x2df/0x4d0 [nvme_core]
     nvme_init_ctrl_finish+0x8d/0x3b0 [nvme_core]
     nvme_tcp_setup_ctrl+0x337/0x390 [nvme_tcp]
     nvme_tcp_reconnect_ctrl_work+0x24/0x40 [nvme_tcp]
     process_one_work+0x1bd/0x360
     worker_thread+0x50/0x3d0

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-29 09:29:06 +02:00
Chris Leech d6d6742772 nvme: fix RCU hole that allowed for endless looping in multipath round robin
Make nvme_ns_remove match the assumptions elsewhere.

1) !NVME_NS_READY needs to be srcu synchronized to make sure nothing is
   running in __nvme_find_path or nvme_round_robin_path that will
   re-assign this ns to current_path.

2) Any matching current_path entries need to be cleared before removing
   from the siblings list, to prevent calling nvme_round_robin_path with
   an "old" ns that's off list.

3) Finally the list_del_rcu can happen, and then synchronize again
   before releasing any reference counts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-29 09:29:06 +02:00
Sungup Moon 5974ea7ce0 nvme: allow duplicate NSIDs for private namespaces
A NVMe subsystem with multiple controller can have private namespaces
that use the same NSID under some conditions:

 "If Namespace Management, ANA Reporting, or NVM Sets are supported, the
  NSIDs shall be unique within the NVM subsystem. If the Namespace
  Management, ANA Reporting, and NVM Sets are not supported, then NSIDs:
   a) for shared namespace shall be unique; and
   b) for private namespace are not required to be unique."

Reference: Section 6.1.6 NSID and Namespace Usage; NVM Express 1.4c spec.

Make sure this specific setup is supported in Linux.

Fixes: 9ad1927a3b ("nvme: always search for namespace head")
Signed-off-by: Sungup Moon <sungup.moon@samsung.com>
[hch: refactored and fixed the controller vs subsystem based naming
      conflict]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-03-29 09:29:06 +02:00
Linus Torvalds 3f7282139f for-5.18/64bit-pi-2022-03-25
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI92rYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkAJD/9PvRN61YnNRjjAiHgslwMc2fy9lkxwYF4j
 +DYqFwnhHgiADO/3Y3wsqHxmDJrhq7vxHM3btxUzkKxg2mVoOI/Bm6rhqEPhNkok
 nlpMWHXR+9Jvl85IO5jHg9GHZ/PZfaDMn9naVXVpHVgycdJ06tr7T1tMtoAtsEzA
 atEkwpc+r8E2NlxkcTPAQhJzmkrHVdxgtWxlKL/RkmivmBXu3/fj2pLHYyPcvqm1
 8LxDn1DIoUHlpce10Qf7r+hf1sXiKNv+nltl9aWxdoSOM8OYHjQcp4K1qe+VYVzC
 XbXqg3ZWaGKSnieyawN2yXtFkZSzgyCy+TCTHnf8NwGfgYYk86twh2clP5t6lE58
 /TC8CmrBHIy8+79BvpSlTh7LlGip0snY3IVbZhR5EHJV3nDVtg/vdDwiSSQ6VdCM
 FM3tkY7KvZDb42IvKzD/NKmAzKv/XMri1MmQB2f/VvbwN3OK5EQOJT1DYFdiohUQ
 1YIb81HiGvlogB783HFXXAcHu/qQNZGDK4EDjNFHThPtmYqtLuOixIo0KG6BJnuV
 sl/YhtDSe3FRnvcDZ4xki9CpBqHFG7vK85H05NXXdC1ddBdQ+N+yLS1/jONUlkGc
 vJphI6FPr+DcPX8o/QuapQpNfg+HXY/h4u83jFJ8VRAyraxSarZ/19at0DM2wdvR
 IhKlNfOHlA==
 =RAVX
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/64bit-pi-2022-03-25' of git://git.kernel.dk/linux-block

Pull block layer 64-bit data integrity support from Jens Axboe:
 "This adds support for 64-bit data integrity in the block layer and in
  NVMe"

* tag 'for-5.18/64bit-pi-2022-03-25' of git://git.kernel.dk/linux-block:
  crypto: fix crc64 testmgr digest byte order
  nvme: add support for enhanced metadata
  block: add pi for extended integrity
  crypto: add rocksoft 64b crc guard tag framework
  lib: add rocksoft model crc64
  linux/kernel: introduce lower_48_bits function
  asm-generic: introduce be48 unaligned accessors
  nvme: allow integrity on extended metadata formats
  block: support pi with extended metadata
2022-03-26 12:01:35 -07:00
Linus Torvalds 561593a048 for-5.18/write-streams-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
 NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
 qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
 Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
 T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
 xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
 H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
 VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
 GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
 M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
 o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
 VaTzP71C1Q==
 =efaq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block

Pull NVMe write streams removal from Jens Axboe:
 "This removes the write streams support in NVMe. No vendor ever really
  shipped working support for this, and they are not interested in
  supporting it.

  With the NVMe support gone, we have nothing in the tree that supports
  this. Remove passing around of the hints.

  The only discussion point in this patchset imho is the fact that the
  file specific write hint setting/getting fcntl helpers will now return
  -1/EINVAL like they did before we supported write hints. No known
  applications use these functions, I only know of one prototype that I
  help do for RocksDB, and that's not used. That said, with a change
  like this, it's always a bit controversial. Alternatively, we could
  just make them return 0 and pretend it worked. It's placement based
  hints after all"

* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
  fs: remove fs.f_write_hint
  fs: remove kiocb.ki_hint
  block: remove the per-bio/request write hint
  nvme: remove support or stream based temperature hint
2022-03-26 11:51:46 -07:00
Monish Kumar R bc360b0b16 nvme-pci: add quirks for Samsung X5 SSDs
Add quirks to not fail the initialization and to have quick resume
latency after cold/warm reboot.

Signed-off-by: Monish Kumar R <monish.kumar.r@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-23 09:19:32 +01:00
Xin Hao 2e21e4454b nvme-pci: expose use_threaded_interrupts read-only in sysfs
Allow reading /sys/module/nvme/parameters/use_threaded_interrupts to see
if the use_threaded_interrupts module parameter is in use.

Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-23 09:19:17 +01:00
Pankaj Raghav 726be2c72e nvme: fix the read-only state for zoned namespaces with unsupposed features
commit 2f4c9ba23b ("nvme: export zoned namespaces without Zone Append
support read-only") marks zoned namespaces without append support
read-only.  It does iso by setting NVME_NS_FORCE_RO in ns->flags in
nvme_update_zone_info and checking for that flag later in
nvme_update_disk_info to mark the disk as read-only.

But commit 73d90386b5 ("nvme: cleanup zone information initialization")
rearranged nvme_update_disk_info to be called before
nvme_update_zone_info and thus not marking the disk as read-only.
The call order cannot be just reverted because nvme_update_zone_info sets
certain queue parameters such as zone_write_granularity that depend on the
prior call to nvme_update_disk_info.

Remove the call to set_disk_ro in nvme_update_disk_info. and call
set_disk_ro after nvme_update_zone_info and nvme_update_disk_info to set
the permission for ZNS drives correctly. The same applies to the
multipath disk path.

Fixes: 73d90386b5 ("nvme: cleanup zone information initialization")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-23 09:19:17 +01:00
Linus Torvalds 9030fb0bb9 Folio changes for 5.18
- Rewrite how munlock works to massively reduce the contention
    on i_mmap_rwsem (Hugh Dickins):
    https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
  - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
    https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
  - Convert GUP to use folios and make pincount available for order-1
    pages. (Matthew Wilcox)
  - Convert a few more truncation functions to use folios (Matthew Wilcox)
  - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
  - Convert rmap_walk to use folios (Matthew Wilcox)
  - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
  - Add support for creating large folios in readahead (Matthew Wilcox)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
 gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
 P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
 s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
 Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
 CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
 v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
 =n9Ad
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Rewrite how munlock works to massively reduce the contention on
   i_mmap_rwsem (Hugh Dickins):

     https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/

 - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
   Hellwig):

     https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/

 - Convert GUP to use folios and make pincount available for order-1
   pages. (Matthew Wilcox)

 - Convert a few more truncation functions to use folios (Matthew
   Wilcox)

 - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
   Wilcox)

 - Convert rmap_walk to use folios (Matthew Wilcox)

 - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)

 - Add support for creating large folios in readahead (Matthew Wilcox)

* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
  mm/damon: minor cleanup for damon_pa_young
  selftests/vm/transhuge-stress: Support file-backed PMD folios
  mm/filemap: Support VM_HUGEPAGE for file mappings
  mm/readahead: Switch to page_cache_ra_order
  mm/readahead: Align file mappings for non-DAX
  mm/readahead: Add large folio readahead
  mm: Support arbitrary THP sizes
  mm: Make large folios depend on THP
  mm: Fix READ_ONLY_THP warning
  mm/filemap: Allow large folios to be added to the page cache
  mm: Turn can_split_huge_page() into can_split_folio()
  mm/vmscan: Convert pageout() to take a folio
  mm/vmscan: Turn page_check_references() into folio_check_references()
  mm/vmscan: Account large folios correctly
  mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  mm/vmscan: Free non-shmem folios without splitting them
  mm/rmap: Constify the rmap_walk_control argument
  mm/rmap: Convert rmap_walk() to take a folio
  mm: Turn page_anon_vma() into folio_anon_vma()
  mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
  ...
2022-03-22 17:03:12 -07:00
Linus Torvalds 69d1dea852 for-5.18/drivers-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0/QUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn8GEACRVxJaJV5qjZfoFAQKoAWJEtquwjeARyB+
 0V8ROWHDWHSacdug9wBytayiS1lz2zmUHJ6YXyts2dn0v6CrK4s8yGzk5G/RgH6+
 6M3GmBKjj+r1DfE8L3OoQWkDR1JFPuFxXTG/uBd7fBY2Excih1Z0D2lpspMleIRf
 w8zBrlWrWH8lZlm6HF3fadjEoiWhOM5F4Ofz3eg/PAQrHuD06z8hjQgMeR0jQVzw
 bWF9jrdNIplxRjNWIwCTsQRM+z5KQhUGwDODJjIwdQtVaKSt9D99ZbeKTudlslQ2
 zrizsCq8P1RjBPcrA45FV6QnT9DIRRGrYzHD63qC6fDae34rbzdSHUwRMP2XSxo8
 +hT1AzGypiBauODTPzHFtTskaQ0KibLznEanChh/ThySmNYcEVAljSx3Z5Vo81J+
 IqJYK2m3RESCFruy9w3U/P7qiXZmqYldPfjxAKq8ucg6x1PU3XRAVm7SI/i4l75D
 Crk1ujj2LJgsyxL6qMrK3XUavl1SJdzWeFSarcCt3m4m11EWWfYzmG8Yn8OE2CEZ
 a2CAyDsRi8CZ3hvkaMwigL4wBJjrrig8vyIgok3VrfCmYlNNqMQqM5Rw7vzjR3v1
 cKewI3rQjkFXEaveIXyGPTI/0Da4cT0DOfn/Mws9MDUXNPlFMNEDUZkPuzMywiTB
 2SWDLTe77g==
 =993h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe updates via Christoph:
      - add vectored-io support for user-passthrough (Kanchan Joshi)
      - add verbose error logging (Alan Adamson)
      - support buffered I/O on block devices in nvmet (Chaitanya
        Kulkarni)
      - central discovery controller support (Martin Belanger)
      - fix and extended the globally unique idenfier validation
        (Christoph)
      - move away from the deprecated IDA APIs (Sagi Grimberg)
      - misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin,
        Chaitanya Kulkarni)
      - add lockdep annotations for in-kernel sockets (Chris Leech)
      - use vmalloc for ANA log buffer (Hannes Reinecke)
      - kerneldoc fixes (Chaitanya Kulkarni)
      - cleanups (Guoqing Jiang, Chaitanya Kulkarni, Christoph)
      - warn about shared namespaces without multipathing (Christoph)

 - MD updates via Song with a set of cleanups (Christoph, Mariusz, Paul,
   Erik, Dirk)

 - loop cleanups and queue depth configuration (Chaitanya)

 - null_blk cleanups and fixes (Chaitanya)

 - Use descriptive init/exit names in virtio_blk (Randy)

 - Use bvec_kmap_local() in drivers (Christoph)

 - bcache fixes (Mingzhe)

 - xen blk-front persistent grant speedups (Juergen)

 - rnbd fix and cleanup (Gioh)

 - Misc fixes (Christophe, Colin)

* tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block: (76 commits)
  virtio_blk: eliminate anonymous module_init & module_exit
  nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH
  nvme: remove nvme_alloc_request and nvme_alloc_request_qid
  nvme: cleanup how disk->disk_name is assigned
  nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate
  nvmet: use snprintf() with PAGE_SIZE in configfs
  nvmet: don't fold lines
  nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal
  nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport
  nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport
  nvme-tcp: lockdep: annotate in-kernel sockets
  nvme-tcp: don't fold the line
  nvme-tcp: don't initialize ret variable
  nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio
  nvme-multipath: use vmalloc for ANA log buffer
  xen/blkfront: speed up purge_persistent_grants()
  raid5: initialize the stripe_head embeeded bios as needed
  raid5-cache: statically allocate the recovery ra bio
  raid5-cache: fully initialize flush_bio when needed
  raid5-ppl: fully initialize the bio in ppl_new_iounit
  ...
2022-03-21 17:16:01 -07:00
Christoph Hellwig ce8d78616a nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH
Start warning about exposing a namespace as multiple block devices,
and set a fixed deprecation release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-03-16 16:48:00 +01:00
Christoph Hellwig e559398f47 nvme: remove nvme_alloc_request and nvme_alloc_request_qid
Just open code the allocation + initialization in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-03-16 09:47:05 +01:00
Christoph Hellwig b739e13705 nvme: cleanup how disk->disk_name is assigned
They way how assigning the disk name and commenting on why it is done
is split over core.c and multipath.c seems to be rather confusing.

Now that ns_head->disk always exists we can do all the work in core.c
and have a single big comment explaining the issues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-03-16 09:47:01 +01:00
Chris Leech 841aee4d75 nvme-tcp: lockdep: annotate in-kernel sockets
Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
Sockets created by nvme-tcp are not exposed to user-space, and will not
trigger certain code paths that the general socket API exposes.

Lockdep complains about a circular dependency between the socket and
filesystem locks, because setsockopt can trigger a page fault with a
socket lock held, but nvme-tcp sends requests on the socket while file
system locks are held.

  ======================================================
  WARNING: possible circular locking dependency detected
  5.15.0-rc3 #1 Not tainted
  ------------------------------------------------------
  fio/1496 is trying to acquire lock:
  (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80

  but task is already holding lock:
  (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]

  which lock already depends on the new lock.

  other info that might help us debug this:

  chain exists of:
   sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&xfs_dir_ilock_class/5);
                                lock(sb_internal);
                                lock(&xfs_dir_ilock_class/5);
   lock(sk_lock-AF_INET);

  *** DEADLOCK ***

  6 locks held by fio/1496:
   #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
   #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
   #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
   #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
   #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
   #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]

This annotation lets lockdep analyze nvme-tcp controlled sockets
independently of what the user-space sockets API does.

Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/

Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-14 08:43:06 +01:00
Chaitanya Kulkarni a387935c24 nvme-tcp: don't fold the line
The call to nvme_tcp_alloc_queue() fits perfectly in one line without
exceeding 80 char limit for the line.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-14 08:43:05 +01:00
Chaitanya Kulkarni 462b8b2d84 nvme-tcp: don't initialize ret variable
No point in initializing ret variable to 0 in nvme_tcp_start_io_queue()
since it gets overwritten by a call to nvme_tcp_start_queue().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-14 08:43:05 +01:00
Guoqing Jiang 8f31dded50 nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio
Use bio_io_error() here since bio_io_error does the same thing.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-14 08:43:05 +01:00
Hannes Reinecke 5e6a7d1eb8 nvme-multipath: use vmalloc for ANA log buffer
The ANA log buffer can get really large, as it depends on the
controller configuration. So to avoid an out-of-memory issue
during scanning use kvmalloc() instead of the kmalloc().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-14 08:43:05 +01:00
Keith Busch 4020aad85c nvme: add support for enhanced metadata
NVM Express ratified TP 4068 defines new protection information formats.
Implement support for the CRC64 guard tags.

Since the block layer doesn't support variable length reference tags,
driver support for the Storage Tag space is not supported at this time.

Cc: Hannes Reinecke <hare@suse.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Klaus Jensen <its@irrelevant.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220303201312.3255347-9-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 12:49:13 -07:00
Keith Busch 84b735429f nvme: allow integrity on extended metadata formats
The block integrity subsystem knows how to construct protection
information buffers with metadata beyond the protection information
fields. Remove the driver restriction.

Note, this can only work if the PI field appears first in the metadata,
as the integrity subsystem doesn't calculate guard tags on preceding
metadata.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220303201312.3255347-3-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 12:48:35 -07:00
Christoph Hellwig 85e6c77576 nvme: remove support or stream based temperature hint
This support was added for RocksDB, but RocksDB ended up not using it.
At the same time drives on the open marked (vs those build for OEMs
for non-Linux support) that actually support streams are extremly
rare.  Don't bloat the nvme driver for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20220304175556.407719-1-hch@lst.de
[axboe: fold in ctrl->nr_streams removal from Keith]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 12:45:56 -07:00
Jens Axboe b46bebaf2a Merge branch 'for-5.18/drivers' into for-5.18/write-streams
* for-5.18/drivers: (51 commits)
  bcache: fixup multiple threads crash
  bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing
  floppy: use memcpy_{to,from}_bvec
  drbd: use bvec_kmap_local in recv_dless_read
  drbd: use bvec_kmap_local in drbd_csum_bio
  bcache: use bvec_kmap_local in bio_csum
  nvdimm-btt: use bvec_kmap_local in btt_rw_integrity
  nvdimm-blk: use bvec_kmap_local in nd_blk_rw_integrity
  zram: use memcpy_from_bvec in zram_bvec_write
  zram: use memcpy_to_bvec in zram_bvec_read
  aoe: use bvec_kmap_local in bvcpy
  iss-simdisk: use bvec_kmap_local in simdisk_submit_bio
  nvme: check that EUI/GUID/UUID are globally unique
  nvme: check for duplicate identifiers earlier
  nvme: fix the check for duplicate unique identifiers
  nvme: cleanup __nvme_check_ids
  nvme: remove nssa from struct nvme_ctrl
  nvme: explicitly set non-error for directives
  nvme: expose cntrltype and dctype through sysfs
  nvme: send uevent on connection up
  ...
2022-03-07 12:44:39 -07:00
Christoph Hellwig dc90f0846d mm: don't include <linux/memremap.h> in <linux/mm.h>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03 12:47:33 -05:00
Christoph Hellwig 2079f41ec6 nvme: check that EUI/GUID/UUID are globally unique
Add a check to verify that the unique identifiers are unique globally
in addition to the existing check that verifies that they are unique
inside a single subsystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-02-28 13:45:07 +02:00
Christoph Hellwig e2d77d2e11 nvme: check for duplicate identifiers earlier
Lift the check for duplicate identifiers into nvme_init_ns_head, which
avoids pointless error unwinding in case they don't match, and also
matches where we check identifier validity for the multipath case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-02-28 13:45:07 +02:00
Christoph Hellwig e2724cb9f0 nvme: fix the check for duplicate unique identifiers
nvme_subsys_check_duplicate_ids should needs to return an error if any of
the identifiers matches, not just if all of them match.  But it does not
need to and should not look at the CSI value for this sanity check.

Rewrite the logic to be separate from nvme_ns_ids_equal and optimize it
by reducing duplicate checks for non-present identifiers.

Fixes: ed754e5dee ("nvme: track shared namespaces")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-02-28 13:45:07 +02:00
Christoph Hellwig fd8099e791 nvme: cleanup __nvme_check_ids
Pass the actual nvme_ns_ids used for the comparison instead of the
ns_head that isn't needed and use a more descriptive function name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-02-28 13:45:06 +02:00
Keith Busch 0a9f850061 nvme: remove nssa from struct nvme_ctrl
The reported number of streams is not used outside the function that
gets it, so no need to stash it in the controller structure. Use a local
variable instead.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Keith Busch 1c3adf0de1 nvme: explicitly set non-error for directives
Stream directives is an optional feature. It is not an error if a
controller doesn't support as many as the kernel can optionally use.
Explicitly set the non-error return value on this condition with a
comment explaining why.

Note, the return value was already 0 in this condition, so the setting
is redundant. This patch should just silence bots that falsely believe
the condition contains an error omission.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Martin Belanger 86c2457a8e nvme: expose cntrltype and dctype through sysfs
TP8010 introduces the Discovery Controller Type attribute (dctype).
The dctype is returned in the response to the Identify command. This
patch exposes the dctype through the sysfs. Since the dctype depends on
the Controller Type (cntrltype), another attribute of the Identify
response, the patch also exposes the cntrltype as well. The dctype will
only be displayed for discovery controllers.

A note about the naming of this attribute:
Although TP8010 calls this attribute the Discovery Controller Type,
note that the dctype is now part of the response to the Identify
command for all controller types. I/O, Discovery, and Admin controllers
all share the same Identify response PDU structure. Non-discovery
controllers as well as pre-TP8010 discovery controllers will continue
to set this field to 0 (which has always been the default for reserved
bytes). Per TP8010, the value 0 now means "Discovery controller type is
not reported" instead of "Reserved". One could argue that this
definition is correct even for non-discovery controllers, and by
extension, exposing it in the sysfs for non-discovery controllers is
appropriate.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Martin Belanger 20d64911e7 nvme: send uevent on connection up
When connectivity with a controller is lost, the driver will keep
trying to reconnect once every 10 sec. When connection is restored,
user-space apps need to be informed so that they can take proper
action. For example, TP8010 introduces the DIM PDU, which is used to
register with a discovery controller (DC). The DIM PDU is sent from
user-space.  The DIM PDU must be sent every time a connection is
established with a DC. Therefore, the kernel must tell user-space apps
when connection is restored so that registration can happen.

The uevent sent is a "change" uevent with environmental data
set to: "NVME_EVENT=connected".

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Kanchan Joshi 89377bc197 nvme: add vectored-io support for user-passthrough
Add a new NVME_IOCTL_IO64_CMD_VEC ioctl that works like the existing
NVME_IOCTL_IO64_CMD ioctl except that it takes and array of iovecs
and thus supports vectored I/O.

  - cmd.addr is base address of user iovec array
  - cmd.vec_cnt is count of iovec array elements

This patch does not include vectored-variant for admin-commands as most
of them are light on buffers and likely to have low invocation frequency.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Alan Adamson bd83fe6f2c nvme: add verbose error logging
Improves logging of NVMe errors.  If NVME_VERBOSE_ERRORS is configured,
a verbose description of the error is logged, otherwise only status
codes/bits is logged.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
[kch]: fix several nits, cosmetics, and trim down code.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Chaitanya Kulkarni 72e8b5cd7d nvme: add a helper to initialize connect_q
Add and use helper to remove duplicate code for fabrics connect_q
initialization and error handling for all the transports.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:06 +02:00
Max Gurtovoy 4686af885a nvme-rdma: add helpers for mapping/unmapping request
Introduce nvme_rdma_dma_map_req/nvme_rdma_dma_unmap_req helper functions
to improve code readability and ease on the error flow.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:05 +02:00
Sagi Grimberg 3dd83f4013 nvme-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:05 +02:00
Sagi Grimberg 8b850475c0 nvme: replace ida_simple[get|remove] with the simler ida_[alloc|free]
ida_simple_[get|remove] are wrappers anyways.

Also, use ida_alloc_min with the ns_ida as namespace
enumeration starts with 1.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:05 +02:00
Chaitanya Kulkarni 581f19dd72 nvme-fabrics: remove unnecessary braces for case
Braces are not required for enum value NVME_SC_CONNECT_INVALID_PARAM
when used on the switch-case statement, remove the braces.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Chaitanya Kulkarni 72b3eab456 nvme-fabrics: use consistent zeroout pattern
Remove zeroout memeset call & zeroout local variable cmd at the time
of declaration in nvmf_ref_read32() similar to what we have done in
nvmf_reg_read64(), nvmf_reg_write32(), nvmf_connect_admin_queue(), and
nvmf_connect_io_queue().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Chaitanya Kulkarni 0801a4b630 nvme-fabrics: use unsigned int type
Loop variable i will never have a negative value, so use
unsigned int type instaed of int.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Chaitanya Kulkarni 572c97355b nvme-fabrics: use unsigned int type
Loop variable i will never have a negative value, so use
unsigned int type instaed of int.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Chaitanya Kulkarni ba3266434d nvme-core: remove unnecessary function parameter
In function nvme_execute_rq() we don't use gendisk parameter at all.
Remove the unsed parameter and adjust the calls.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Chaitanya Kulkarni 50ab19d89f nvme-core: remove unnecessary semicolon
It is not a good practice to have a semicolon at the end of the
function definition. Remove it from nvme_pr_type().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-28 13:45:04 +02:00
Varun Prakash c2700d2886 nvme-tcp: send H2CData PDUs based on MAXH2CDATA
As per NVMe/TCP specification (revision 1.0a, section 3.6.2.3)
Maximum Host to Controller Data length (MAXH2CDATA): Specifies the
maximum number of PDU-Data bytes per H2CData PDU in bytes. This value
is a multiple of dwords and should be no less than 4,096.

Current code sets H2CData PDU data_length to r2t_length,
it does not check MAXH2CDATA value. Fix this by setting H2CData PDU
data_length to min(req->h2cdata_left, queue->maxh2cdata).

Also validate MAXH2CDATA value returned by target in ICResp PDU,
if it is not a multiple of dword or if it is less than 4096 return
-EINVAL from nvme_tcp_init_connection().

Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-23 14:43:11 +01:00
Christoph Hellwig 602e57c979 nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info
Commit e7d65803e2 ("nvme-multipath: revalidate paths during rescan")
introduced the NVME_NS_READY flag, which nvme_path_is_disabled() uses
to check if a path can be used or not.  We also need to set this flag
for devices that fail the ZNS feature validation and which are available
through passthrough devices only to that they can be used in multipathing
setups.

Fixes: e7d65803e2 ("nvme-multipath: revalidate paths during rescan")
Reported-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
2022-02-23 14:42:58 +01:00
Christoph Hellwig 363f636860 nvme: don't return an error from nvme_configure_metadata
When a fabrics controller claims to support an invalidate metadata
configuration we already warn and disable metadata support.  No need to
also return an error during revalidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
2022-02-23 14:42:51 +01:00
Christoph Hellwig 7a5428dcb7 block: fix surprise removal for drivers calling blk_set_queue_dying
Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9eb8 that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.

Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.

Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17 07:54:03 -07:00
Sagi Grimberg 63573807b2 nvme-tcp: fix bogus request completion when failing to send AER
AER is not backed by a real request, hence we should not incorrectly
assume that when failing to send a nvme command, it is a normal request
but rather check if this is an aer and if so complete the aer (similar
to the normal completion path).

Cc: stable@vger.kernel.org
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-09 14:50:42 +01:00
Bean Huo 00e757b648 nvme: add nvme_complete_req tracepoint for batched completion
Add NVMe request completion trace in nvme_complete_batch_req() because
nvme:nvme_complete_req tracepoint is missing in case of request batched
completion.

Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-09 14:50:42 +01:00
Uday Shankar 6a51abdeb2 nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts()
Controller deletion/reset, immediately followed by or concurrent with
a reconnect, is hard failing the connect attempt resulting in a
complete loss of connectivity to the controller.

In the connect request, fabrics looks for an existing controller with
the same address components and aborts the connect if a controller
already exists and the duplicate connect option isn't set. The match
routine filters out controllers that are dead or dying, so they don't
interfere with the new connect request.

When NVME_CTRL_DELETING_NOIO was added, it missed updating the state
filters in the nvmf_ctlr_matches_baseopts() routine. Thus, when in this
new state, it's seen as a live controller and fails the connect request.

Correct by adding the DELETING_NIO state to the match checks.

Fixes: ecca390e80 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")
Cc: <stable@vger.kernel.org> # v5.7+
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-03 07:30:57 +01:00
Sagi Grimberg b6bb1722f3 nvme-rdma: fix possible use-after-free in transport error_recovery work
While nvme_rdma_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-02 09:19:07 +01:00
Sagi Grimberg ff9fc7ebf5 nvme-tcp: fix possible use-after-free in transport error_recovery work
While nvme_tcp_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.

Tested-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-02 09:19:07 +01:00
Sagi Grimberg 0fa0f99fc8 nvme: fix a possible use-after-free in controller reset during load
Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl
readiness for AER submission. This may lead to a use-after-free
condition that was observed with nvme-tcp.

The race condition may happen in the following scenario:
1. driver executes its reset_ctrl_work
2. -> nvme_stop_ctrl - flushes ctrl async_event_work
3. ctrl sends AEN which is received by the host, which in turn
   schedules AEN handling
4. teardown admin queue (which releases the queue socket)
5. AEN processed, submits another AER, calling the driver to submit
6. driver attempts to send the cmd
==> use-after-free

In order to fix that, add ctrl state check to validate the ctrl
is actually able to accept the AER submission.

This addresses the above race in controller resets because the driver
during teardown should:
1. change ctrl state to RESETTING
2. flush async_event_work (as well as other async work elements)

So after 1,2, any other AER command will find the
ctrl state to be RESETTING and bail out without submitting the AER.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2022-02-02 09:19:05 +01:00
Changcheng Deng a5f3851b7f nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show
Remove unneeded variable and directly return 0.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-01-27 08:17:17 +01:00
Wu Zheng 25e58af4be nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs
The Intel P4500/P4600 SSDs do not report a subsystem NQN despite claiming
compliance to a standards version where reporting one is required.

Add the IGNORE_DEV_SUBNQN quirk to not fail the initialization of a
second such SSDs in a system.

Signed-off-by: Zheng Wu <wu.zheng@intel.com>
Signed-off-by: Ye Jinhe <jinhe.ye@intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-01-27 08:17:14 +01:00
Linus Torvalds c9193f48e9 for-5.17/drivers-2022-01-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8EIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnOKEADGpxp+Vntbm8nZI/PFP5fA2gUTZWgSVB4l
 axVTYW21pjSrsrAhGg2FIgBgL0tNkgxQnIPRn50YL8jT3pTkCEcR7kLbhEU7W/Ln
 7hrsBgFnsCBoCs38LvzXHZD69jtEtNRk1ijPMLo5iCcHkAyUVKa1glfeMwefuI5/
 Rl8SoueRXppvCfwNPptaAKiDsYVN8KCJPvvhlMNoKP5n1iTsNYJ/HVsLqfRnP0oc
 CR6eHaYceWGLER8tWtBlG2Qp40+cd/A320thkIlEpEKJPWE/ce5AUp0PYxVJbwjU
 qvO1tMYSya7gPiaVWRJcUeAgRFiivM/kTdDrGwiY9hpv/BQG7EAW5D9Xecz/M4UG
 BgNLfhe0aR9QssjPxITgyiy9sRpwwpnpoVONTu3slgXVTUVlOq0QT6LOTPR1B9A4
 ZjbHVCuI3eyrAOqD4IjYSqjHa6GjFLiKTh8Q0ZB/KJGX1eItLVLVdJfcfV4RkBIf
 6RZg9+7/mXaDxU74DZ2tfUhHT0sC5RS+5VFxpkhThVk9qRbVdZGGWAHcVOkMjk9B
 L4PCpJeuaR+rzXvCDOCOI5sHraa5F/IRhMaTu5sHj/MIuEpq1fqjaB7tWRvfm6HO
 4tepUtb++rS3/zFFQlZCLyjVk2o0p2b0viwPLjvsRqsBp1bVoO9mJIiyp6POmM3G
 UjxQS0vEDw==
 =k0IZ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - mtip32xx pci cleanups (Bjorn)

 - mtip32xx conversion to generic power management (Vaibhav)

 - rsxx pci powermanagement cleanups (Bjorn)

 - Remove the rsxx driver. This hardware never saw much adoption, and
   it's been end of lifed for a while. (Christoph)

 - MD pull request from Song:
      - REQ_NOWAIT support (Vishal Verma)
      - raid6 benchmark optimization (Dirk Müller)
      - Fix for acct bioset (Xiao Ni)
      - Clean up max_queued_requests (Mariusz Tkaczyk)
      - PREEMPT_RT optimization (Davidlohr Bueso)
      - Use default_groups in kobj_type (Greg Kroah-Hartman)

 - Use attribute groups in pktcdvd and rnbd (Greg)

 - NVMe pull request from Christoph:
      - increment request genctr on completion (Keith Busch, Geliang
        Tang)
      - add a 'iopolicy' module parameter (Hannes Reinecke)
      - print out valid arguments when reading from /dev/nvme-fabrics
        (Hannes Reinecke)

 - Use struct_group() in drbd (Kees)

 - null_blk fixes (Ming)

 - Get rid of congestion logic in pktcdvd (Neil)

 - Floppy ejection hang fix (Tasos)

 - Floppy max user request size fix (Xiongwei)

 - Loop locking fix (Tetsuo)

* tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits)
  md: use default_groups in kobj_type
  md: Move alloc/free acct bioset in to personality
  lib/raid6: Use strict priority ranking for pq gen() benchmarking
  lib/raid6: skip benchmark of non-chosen xor_syndrome functions
  md: fix spelling of "its"
  md: raid456 add nowait support
  md: raid10 add nowait support
  md: raid1 add nowait support
  md: add support for REQ_NOWAIT
  md: drop queue limitation for RAID1 and RAID10
  md/raid5: play nice with PREEMPT_RT
  block/rnbd-clt-sysfs: use default_groups in kobj_type
  pktcdvd: convert to use attribute groups
  block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0
  nvme: add 'iopolicy' module parameter
  nvme: drop unused variable ctrl in nvme_setup_cmd
  nvme: increment request genctr on completion
  nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
  block: remove the rsxx driver
  rsxx: Drop PCI legacy power management
  ...
2022-01-12 10:35:23 -08:00
Linus Torvalds d3c8108035 for-5.17/block-2022-01-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8DAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnhRD/wMAjsNO65PCA+o/bPpVi4ulx9EejAzrJnB
 5vHFvREAoOOGKvRpYGe4w3TcKyW+zPb+GtlXFjPfK+wuVzWhrQtW/+vkjKlBt8wK
 o7rzeMwTKJ9ZGvYaaQpp1yC0WURBB3qnCRQhb8dOQzhJgEXinhIOznZsut4mniLv
 fTqcDmKAb/+G6K6CQCCqnH0I/+OJZyUeSFo1kk2i4ZqCBepQpBkOL6H2rBOtGxUg
 bt1jiGHbbhCRYEE3u2kV0HP10qAChNaMQC705jV4Qpf4+3EntSxs+6nSb74dvMkX
 3+Wmp8Ctq6lpPnDL1nrAFGz3jZnB0Y+GdgOclQn3ViQd1FCXZzuYWQ3fTaBfURCZ
 /RE5nc047SqpwCFLOynM++OkaeQZ1zSxeyoFTtzDaPF4tLuaX3JHswvTzNGPw8SN
 BnexseNnNBCjJliZSEE7fOkjJDcev2dvRxPtI8/wkF4lHUgETc5IW563C53xo/Tx
 32yFjZwCVIpNWk21su/0H3iEq80wZ7PnriiN/E3JA6XbnevlRPu0NPMb0D258GCm
 yCcdPVDNZsQCB8hluqZcu0g6LSgZRo90Yg1oqKqEpAllJJMBaEAPPPuUIJh998mo
 iKGxZzgr7d9jrbGJTInp0F8b3B3/oV/hxgzy0Hu/mHP3AsnaAk9o/oEQZ7rX4Khr
 6biloqkIMA==
 =RWnJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Unify where the struct request handling code is located in the blk-mq
   code (Christoph)

 - Header cleanups (Christoph)

 - Clean up the io_context handling code (Christoph, me)

 - Get rid of ->rq_disk in struct request (Christoph)

 - Error handling fix for add_disk() (Christoph)

 - request allocation cleanusp (Christoph)

 - Documentation updates (Eric, Matthew)

 - Remove trivial crypto unregister helper (Eric)

 - Reduce shared tag overhead (John)

 - Reduce poll_stats memory overhead (me)

 - Known indirect function call for dio (me)

 - Use atomic references for struct request (me)

 - Support request list issue for block and NVMe (me)

 - Improve queue dispatch pinning (Ming)

 - Improve the direct list issue code (Keith)

 - BFQ improvements (Jan)

 - Direct completion helper and use it in mmc block (Sebastian)

 - Use raw spinlock for the blktrace code (Wander)

 - fsync error handling fix (Ye)

 - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me)

* tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits)
  MAINTAINERS: add entries for block layer documentation
  docs: block: remove queue-sysfs.rst
  docs: sysfs-block: document virt_boundary_mask
  docs: sysfs-block: document stable_writes
  docs: sysfs-block: fill in missing documentation from queue-sysfs.rst
  docs: sysfs-block: add contact for nomerges
  docs: sysfs-block: sort alphabetically
  docs: sysfs-block: move to stable directory
  block: don't protect submit_bio_checks by q_usage_counter
  block: fix old-style declaration
  nvme-pci: fix queue_rqs list splitting
  block: introduce rq_list_move
  block: introduce rq_list_for_each_safe macro
  block: move rq_list macros to blk-mq.h
  block: drop needless assignment in set_task_ioprio()
  block: remove unnecessary trailing '\'
  bio.h: fix kernel-doc warnings
  block: check minor range in device_add_disk()
  block: use "unsigned long" for blk_validate_block_size().
  block: fix error unwinding in device_add_disk
  ...
2022-01-12 10:26:52 -08:00
Keith Busch 6bfec7992e nvme-pci: fix queue_rqs list splitting
If command prep fails, current handling will orphan subsequent requests
in the list. Consider a simple example:

  rqlist = [ 1 -> 2 ]

When prep for request '1' fails, it will be appended to the
'requeue_list', leaving request '2' disconnected from the original
rqlist and no longer tracked. Meanwhile, rqlist is still pointing to the
failed request '1' and will attempt to submit the unprepped command.

Fix this by updating the rqlist accordingly using the request list
helper functions.

Fixes: d62cbcf62f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-5-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-05 12:25:42 -07:00
Hannes Reinecke e3d3479439 nvme: add 'iopolicy' module parameter
While the 'iopolicy' sysfs attribute can be set at runtime, most
storage arrays prefer to use the 'round-robin' iopolicy per default.
We can use udev rules to set this, but is getting rather unwieldy
for rebranded arrays as we would have to update the udev rules
anytime a new array shows up, leading to the same mess we currently
have in multipathd for configuring the RDAC arrays.

Hence this patch adds a module parameter 'iopolicy' to allow the
admin to switch the default, and to do away with the need for a
udev rule here.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-23 11:22:46 +01:00
Geliang Tang 3a605e32a7 nvme: drop unused variable ctrl in nvme_setup_cmd
The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:

 drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
 drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
   struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
                     ^~~~

Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-23 11:22:46 +01:00
Keith Busch e4fdb2b167 nvme: increment request genctr on completion
The nvme request generation counter is intended to catch duplicate
completions. Incrementing the counter on submission means duplicates can
only be caught if the request tag is reallocated and dispatched prior to
the driver observing the corrupted CQE. Incrementing on completion
removes this window, making it possible to detect duplicate completions
in consecutive entries.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-23 11:22:45 +01:00
Hannes Reinecke f18ee3d988 nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-23 11:22:45 +01:00
Jens Axboe d62cbcf62f nvme: add support for mq_ops->queue_rqs()
This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:54:36 -07:00
Jens Axboe 62451a2b2e nvme: separate command prep and issue
Add a nvme_prep_rq() helper to setup a command, and nvme_queue_rq() is
adapted to use this helper.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:54:35 -07:00
Jens Axboe 3233b94cf8 nvme: split command copy into a helper
We'll need it for batched submit as well. Since we now have a copy
helper, get rid of the nvme_submit_cmd() wrapper.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:54:30 -07:00
Ruozhu Li 8b77fa6fdc nvme: fix use after free when disconnecting a reconnecting ctrl
A crash happens when trying to disconnect a reconnecting ctrl:

 1) The network was cut off when the connection was just established,
    scan work hang there waiting for some IOs complete.  Those I/Os were
    retried because we return BLK_STS_RESOURCE to blk in reconnecting.
 2) After a while, I tried to disconnect this connection.  This
    procedure also hangs because it tried to obtain ctrl->scan_lock.
    It should be noted that now we have switched the controller state
    to NVME_CTRL_DELETING.
 3) In nvme_check_ready(), we always return true when ctrl->state is
    NVME_CTRL_DELETING, so those retrying I/Os were issued to the bottom
    device which was already freed.

To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
device only when queue state is live.  If not, return host path error to
the block layer

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-07 18:21:16 +01:00
Hou Tao c7c15ae3dc nvme-multipath: set ana_log_size to 0 after free ana_log_buf
Set ana_log_size to 0 when ana_log_buf is freed to make sure
nvme_mpath_init_identify will do the right thing when retrying
after an earlier failure.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-07 18:19:28 +01:00
Niklas Cassel 793fcab83f nvme: report write pointer for a full zone as zone start + zone len
The write pointer in NVMe ZNS is invalid for a zone in zone state full.
The same also holds true for ZAC/ZBC.

The current behavior for NVMe is to simply propagate the wp reported by
the drive, even for full zones. Since the wp is invalid for a full zone,
the wp reported by the drive may be any value.

The way that the sd_zbc driver handles a full zone is to always report
the wp as zone start + zone len, regardless of what the drive reported.
null_blk also follows this convention.

Do the same for NVMe, so that a BLKREPORTZONE ioctl reports the write
pointer for a full zone in a consistent way, regardless of the interface
of the underlying zoned block device.

blkzone report before patch:
start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0xfffffffffffbfff8
reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]

blkzone report after patch:
start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0x040000 reset:0
non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-06 08:52:08 +01:00
Keith Busch d39ad2a45c nvme: disable namespace access for unsupported metadata
The only fabrics target that supports metadata handling through the
separate integrity buffer is RDMA. It is currently usable only if the
size is 8B per block and formatted for protection information. If an
rdma target were to export a namespace with a different format (ex:
4k+64B), the driver will not be able to submit valid read/write commands
for that namespace.

Suppress setting the metadata feature in the namespace so that the
gendisk capacity will be set to 0. This will prevent read/write access
through the block stack, but will continue to allow ioctl passthrough
commands.

Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-06 08:52:08 +01:00
Keith Busch 16cc33b237 nvme: show subsys nqn for duplicate cntlids
The driver assigned nvme handle isn't persistent across reboots, so is
not enough information to match up where the collisions are occuring.
Add the subsys nqn string to the output so that it can more easily be
identified later.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215099
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-06 08:52:08 +01:00
Christoph Hellwig b84ba30b6c block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now.  Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig f3fa33acca block: remove the ->rq_disk field in struct request
Just use the disk attached to the request_queue instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Klaus Jensen 00b33cf3da nvme: fix write zeroes pi
Write Zeroes sets PRACT when block integrity is enabled (as it should),
but neglects to also set the reftag which is expected by reads. This
causes protection errors on reads.

Fix this by setting the reftag for type 1 and 2 (for type 3, reads will
not check the reftag).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Maurizio Lombardi 8e8aaf512a nvme-fabrics: ignore invalid fast_io_fail_tmo values
Valid fast_io_fail_tmo values are integers >= 0 or -1 (disabled).
Prevent userspace from setting arbitrary negative values.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Enzo Matsumiya 5a6254d55e nvme-pci: add NO APST quirk for Kioxia device
This particular Kioxia device times out and aborts I/O during any load,
but it's more easily observable with discards (fstrim).

The device gets to a state that is also not possible to use
"nvme set-feature" to disable APST.
Booting with nvme_core.default_ps_max_latency=0 solves the issue.

We had a dozen or so of these devices behaving this same way in
customer environments.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Maurizio Lombardi a5053c92b3 nvme-tcp: fix memory leak when freeing a queue
Release the page frag cache when tearing down the io queues

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Varun Prakash 1d3ef9c3a3 nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t()
If maxh2cdata < r2t_length then driver will form multiple
H2CData PDUs, validate R2T PDU in nvme_tcp_handle_r2t() to
reuse nvme_tcp_setup_h2c_data_pdu().

Also set req->state to NVME_TCP_SEND_H2C_PDU in
nvme_tcp_setup_h2c_data_pdu().

Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:40 +01:00
Linus Torvalds 3e28850cbd for-5.16/block-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
 uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
 1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
 5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
 zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
 5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
 h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
 r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
 2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
 rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
 WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
 z9Rn0aFv4g==
 =1GPi
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Set of fixes for the batched tag allocation (Ming, me)

 - add_disk() error handling fix (Luis)

 - Nested queue quiesce fixes (Ming)

 - Shared tags init error handling fix (Ye)

 - Misc cleanups (Jean, Ming, me)

* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
  nvme: wait until quiesce is done
  scsi: make sure that request queue queiesce and unquiesce balanced
  scsi: avoid to quiesce sdev->request_queue two times
  blk-mq: add one API for waiting until quiesce is done
  blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
  block: fix device_add_disk() kobject_create_and_add() error handling
  block: ensure cached plug request matches the current queue
  block: move queue enter logic into blk_mq_submit_bio()
  block: make bio_queue_enter() fast-path available inline
  block: split request allocation components into helpers
  block: have plug stored requests hold references to the queue
  blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
  blk-mq: add RQF_ELV debug entry
  blk-mq: only try to run plug merge if request has same queue with incoming bio
  block: move RQF_ELV setting into allocators
  dm: don't stop request queue after the dm device is suspended
  block: replace always false argument with 'false'
  block: assign correct tag before doing prefetch of request
  blk-mq: fix redundant check of !e expression
2021-11-09 11:20:07 -08:00
Ming Lei 26af1cd003 nvme: wait until quiesce is done
NVMe uses one atomic flag to check if quiesce is needed. If quiesce is
started, the helper returns immediately. This way is wrong, since we
have to wait until quiesce is done.

Fixes: e70feb8b3e ("blk-mq: support concurrent queue quiesce/unquiesce")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 08:14:27 -07:00
Linus Torvalds 643a7234e0 for-5.16/drivers-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KFsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph1ZEACwNuHkAZcIgNzKhzuLP9OjMhv9vV+q254G
 /EcM31e+qgRioMd0ihbVsgW76jOwLEmb3ldKGcN+0Wo5+Sv9Im8+wAWYY1REOZO5
 ZTUBfAzhEh63/EtqTFiU8U+7dmXqy4z7NaICnhlynjwkd3IT+I561os6kcqwJMMr
 G+Q1Cnk9rgCMIoLOCoVThIpjmjyZzF33qJb2VEIkHfkot62iNdpABWaSASF+CCba
 z8LfbvLAYz3YLl4thXlLJFU282T5y7gzgSomGvX4F0rMJSbqFbgoNEPxaYw9CvzC
 uC6MnYCYdCdvVkWVm1b8I8LYzPd5GrpVOSh3JQGvuA4Ppv2IyJCDSruYGgVUlhao
 cVPzuHCqNCfKk0ykYVRZy9oKiBk5wmFeKM/lSHu408y8VNraPNIAEpB6sA9qGr22
 AYr8lNh3JDr0g8dtFsDOq+7u3MANW0KQozfzwTPZo6NjzEE1D2jIg39Ljiijo9+Y
 3pU8pitIAhsKd2KhW1H6LmtJbF4dX756VKYDXOhzgORU0NZYgvGhBIj9tAdpQR0S
 xeae5Kj0/wBGcqR/owf/n1EY/q7rWgNDETnsBhbmzMZyhwH3L6zhT+bfD8YoQCHY
 ueyqhyIUe4YBxTrIpICqwDlqaMYAmQ0jRaci+bK9ovVlQ89FQ9o/BE2COPlI/DGX
 w+rUmmoX4g==
 =HiWU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - paride driver cleanups (Christoph)

 - Remove cryptoloop support (Christoph)

 - null_blk poll support (me)

 - Now that add_disk() supports proper error handling, add it to various
   drivers (Luis)

 - Make ataflop actually work again (Michael)

 - s390 dasd fixes (Stefan, Heiko)

 - nbd fixes (Yu, Ye)

 - Remove redundant wq flush in mtip32xx (Christophe)

 - NVMe updates
      - fix a multipath partition scanning deadlock (Hannes Reinecke)
      - generate uevent once a multipath namespace is operational again
        (Hannes Reinecke)
      - support unique discovery controller NQNs (Hannes Reinecke)
      - fix use-after-free when a port is removed (Israel Rukshin)
      - clear shadow doorbell memory on resets (Keith Busch)
      - use struct_size (Len Baker)
      - add error handling support for add_disk (Luis Chamberlain)
      - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
      - use a few more symbolic names (Max Gurtovoy)
      - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
      - add support for ->map_queues on FC (Saurav Kashyap)
      - support the current discovery subsystem entry (Hannes Reinecke)
      - use flex_array_size and struct_size (Len Baker)

 - bcache fixes (Christoph, Coly, Chao, Lin, Qing)

 - MD updates (Christoph, Guoqing, Xiao)

 - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye)

* tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits)
  null_blk: Fix handling of submit_queues and poll_queues attributes
  block: ataflop: Fix warning comparing pointer to 0
  bcache: replace snprintf in show functions with sysfs_emit
  bcache: move uapi header bcache.h to bcache code directory
  nvmet: use flex_array_size and struct_size
  nvmet: register discovery subsystem as 'current'
  nvmet: switch check for subsystem type
  nvme: add new discovery log page entry definitions
  block: ataflop: more blk-mq refactoring fixes
  block: remove support for cryptoloop and the xor transfer
  mtd: add add_disk() error handling
  rnbd: add error handling support for add_disk()
  um/drivers/ubd_kern: add error handling support for add_disk()
  m68k/emu/nfblock: add error handling support for add_disk()
  xen-blkfront: add error handling support for add_disk()
  bcache: add error handling support for add_disk()
  dm: add add_disk() error handling
  block: aoe: fixup coccinelle warnings
  nvmet: use struct_size over open coded arithmetic
  nvme: drop scan_lock and always kick requeue list when removing namespaces
  ...
2021-11-01 09:27:38 -07:00
Linus Torvalds 33c8846c81 for-5.16/block-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
 ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
 lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
 OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
 Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
 E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
 mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
 DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
 oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
 Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
 /LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
 kMmpCygUrA==
 =QWC+
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - mq-deadline accounting improvements (Bart)

 - blk-wbt timer fix (Andrea)

 - Untangle the block layer includes (Christoph)

 - Rework the poll support to be bio based, which will enable adding
   support for polling for bio based drivers (Christoph)

 - Block layer core support for multi-actuator drives (Damien)

 - blk-crypto improvements (Eric)

 - Batched tag allocation support (me)

 - Request completion batching support (me)

 - Plugging improvements (me)

 - Shared tag set improvements (John)

 - Concurrent queue quiesce support (Ming)

 - Cache bdev in ->private_data for block devices (Pavel)

 - bdev dio improvements (Pavel)

 - Block device invalidation and block size improvements (Xie)

 - Various cleanups, fixes, and improvements (Christoph, Jackie,
   Masahira, Tejun, Yu, Pavel, Zheng, me)

* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
  blk-mq-debugfs: Show active requests per queue for shared tags
  block: improve readability of blk_mq_end_request_batch()
  virtio-blk: Use blk_validate_block_size() to validate block size
  loop: Use blk_validate_block_size() to validate block size
  nbd: Use blk_validate_block_size() to validate block size
  block: Add a helper to validate the block size
  block: re-flow blk_mq_rq_ctx_init()
  block: prefetch request to be initialized
  block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
  block: add rq_flags to struct blk_mq_alloc_data
  block: add async version of bio_set_polled
  block: kill DIO_MULTI_BIO
  block: kill unused polling bits in __blkdev_direct_IO()
  block: avoid extra iter advance with async iocb
  block: Add independent access ranges support
  blk-mq: don't issue request directly in case that current is to be blocked
  sbitmap: silence data race warning
  blk-cgroup: synchronize blkg creation against policy deactivation
  block: refactor bio_iov_bvec_set()
  block: add single bio async direct IO helper
  ...
2021-11-01 09:19:50 -07:00
Len Baker d156cfcafb nvmet: use flex_array_size and struct_size
In an effort to avoid open-coded arithmetic in the kernel [1], use the
flex_array_size() and struct_size() helpers instead of an open-coded
calculation.

[1] https://github.com/KSPP/linux/issues/160

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 08:06:41 +02:00
Varun Prakash d89b9f3bbb nvme-tcp: fix data digest pointer calculation
ddgst is of type __le32, &req->ddgst + req->offset
increases &req->ddgst by 4 * req->offset, fix this by
type casting &req->ddgst to u8 *.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 07:58:26 +02:00
Varun Prakash ce7723e9cd nvme-tcp: fix possible req->offset corruption
With commit db5ad6b7f8 ("nvme-tcp: try to send request in queue_rq
context") r2t and response PDU can get processed while send function
is executing.

Current data digest send code uses req->offset after kernel_sendmsg(),
this creates a race condition where req->offset gets reset before it
is used in send function.

This can happen in two cases -
1. Target sends r2t PDU which resets req->offset.
2. Target send response PDU which completes the req and then req is
   used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset.

Fix this by storing req->offset in a local variable and using
this local variable after kernel_sendmsg().

Fixes: db5ad6b7f8 ("nvme-tcp: try to send request in queue_rq context")
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 07:58:26 +02:00
Sagi Grimberg 25e1f67eda nvme-tcp: fix H2CData PDU send accounting (again)
We should not access request members after the last send, even to
determine if indeed it was the last data payload send. The reason is
that a completion could have arrived and trigger a new execution of the
request which overridden these members. This was fixed by commit
825619b09a ("nvme-tcp: fix possible use-after-completion").

Commit e371af033c broke that assumption again to address cases where
multiple r2t pdus are sent per request. To fix it, we need to record the
request data_sent and data_len and after the payload network send we
reference these counters to determine weather we should advance the
request iterator.

Fixes: e371af033c ("nvme-tcp: fix incorrect h2cdata pdu offset accounting")
Reported-by: Keith Busch <kbusch@kernel.org>
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-26 10:41:29 +02:00
Hannes Reinecke 2b81a5f015 nvme: drop scan_lock and always kick requeue list when removing namespaces
When reading the partition table on initial scan hits an I/O error the
I/O will hang with the scan_mutex held:

[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420

and trying to delete the controller will deadlock as it tries to grab
the scan mutex:

[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]

As we're now properly ordering the namespace list there is no need to
hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
And we always need to kick the requeue list as the path will be marked
as unusable and I/O will be requeued _without_ a current path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:23:30 +02:00
Keith Busch 58847f12fe nvme-pci: clear shadow doorbell memory on resets
The host memory doorbell and event buffers need to be initialized on
each reset so the driver doesn't observe stale values from the previous
instantiation.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: John Levon <john.levon@nutanix.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:23:29 +02:00
Max Gurtovoy 0974812200 nvme-rdma: fix error code in nvme_rdma_setup_ctrl
In case that icdoff is not zero or mandatory keyed sgls are not
supported by the NVMe/RDMA target, we'll go to error flow but we'll
return 0 to the caller. Fix it by returning an appropriate error code.

Fixes: c66e2998c8 ("nvme-rdma: centralize controller setup sequence")
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:03 +02:00
Luis Chamberlain 11384580e3 nvme-multipath: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Since we now can tell for sure when a disk was added, move
setting the bit NVME_NSHEAD_DISK_LIVE only when we did
add the disk successfully.

Nothing to do here as the cleanup is done elsewhere. We take
care and use test_and_set_bit() because it is protects against
two nvme paths simultaneously calling device_add_disk() on the
same namespace head.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:03 +02:00
Hannes Reinecke e5ea42faa7 nvme: display correct subsystem NQN
With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke 20e8b689c9 nvme: Add connect option 'discovery'
Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke 954ae16681 nvme: expose subsystem type in sysfs attribute 'subsystype'
With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Max Gurtovoy 44c3c6257e nvme-rdma: limit the maximal queue size for RDMA controllers
Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In
case the target expose a cap of larger amount of entries (e.g. 1024),
the initiator may fail to create a QP with this size. Thus limit to a
value that works for all RDMA adapters.

Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Saurav Kashyap 01d838164b nvme-fc: add support for ->map_queues
NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP
transports.  Add a ->map_queues callout for the LLDDs to provide such
functionality.

Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Hannes Reinecke f6f09c15a7 nvme: generate uevent once a multipath namespace is operational again
When fast_io_fail_tmo is set I/O will be aborted while recovery is
still ongoing. This causes MD to set the namespace to failed, and
no futher I/O will be submitted to that namespace.

However, once the recovery succeeds and the namespace becomes
operational again the NVMe subsystem doesn't send a notification,
so MD cannot automatically reinstate operation and requires
manual interaction.

This patch will send a KOBJ_CHANGE uevent per multipathed namespace
once the underlying controller transitions to LIVE, allowing an automatic
MD reassembly with these udev rules:

/etc/udev/rules.d/65-md-auto-re-add.rules:
SUBSYSTEM!="block", GOTO="md_end"

ACTION!="change", GOTO="md_end"
ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end"
PROGRAM="/sbin/md_raid_auto_readd.sh $devnode"
LABEL="md_end"

/sbin/md_raid_auto_readd.sh:

MDADM=/sbin/mdadm
DEVNAME=$1

export $(${MDADM} --examine --export ${DEVNAME})

if [ -z "${MD_UUID}" ]; then
    exit 1
fi

UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID})
MD_DEVNAME=${UUID_LINK##*/}
export $(${MDADM} --detail --export /dev/${MD_DEVNAME})
if [ -z "${MD_METADATA}" ] ; then
    exit 1
fi
if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then
    echo "${MD_DEVNAME}: array not degraded, nothing to do"
    exit 0
fi
MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state)
if [ ${MD_STATE} != "clean" ] ; then
    echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add"
    exit 1
fi
MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE"
if [ ${!MD_VARNAME} = "spare" ] ; then
    ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME}
fi

Changes to v2:
- Add udev rules example to description
Changes to v1:
- use disk_uevent() as suggested by hch

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Ming Lei 9e6a6b1212 nvme: paring quiesce/unquiesce
The current blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() always
stops and starts the queue unconditionally. And there can be concurrent
quiesce/unquiesce coming from different unrelated code paths, so
unquiesce may come unexpectedly and start queue too early.

Prepare for supporting concurrent quiesce/unquiesce from multiple
contexts, so that we can address the above issue.

NVMe has very complicated quiesce/unquiesce use pattern, add one atomic
bit for makeiing sure that blk-mq quiece/unquiesce is always called in
pair.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei ebc9b95260 nvme: prepare for pairing quiescing and unquiescing
Add two helpers so that we can prepare for pairing quiescing and
unquiescing which will be done in next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei 6ca1d9027e nvme: apply nvme API to quiesce/unquiesce admin queue
Apply the added two APIs to quiesce/unquiesce admin queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei a277654baf nvme: add APIs for stopping/starting admin queue
Add two APIs for stopping and starting admin queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Jens Axboe a9a7e30fd9 nvme: don't memset() the normal read/write command
This memset in the fast path costs a lot of cycles on my setup. Here's a
top-of-profile of doing ~6.7M IOPS:

+    5.90%  io_uring  [nvme]            [k] nvme_queue_rq
+    5.32%  io_uring  [nvme_core]       [k] nvme_setup_cmd
+    5.17%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
+    4.97%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO

and a perf diff with this patch:

     0.92%     +4.40%  [nvme_core]       [k] nvme_setup_cmd

reducing it from 5.3% to only 0.9%. This takes it from the 2nd most
cycle consumer to something that's mostly irrelevant.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 12:41:09 -06:00
Jens Axboe 9c3d29296f nvme: move command clear into the various setup helpers
We don't have to worry about doing extra memsets by moving it outside
the protection of RQF_DONTPREP, as nvme doesn't do partial completions.

This is in preparation for making the read/write fast path not do a full
memset of the command.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 12:40:51 -06:00
Jens Axboe 4f5022453a nvme: wire up completion batching for the IRQ path
Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:47 -06:00
Jens Axboe c234a65392 nvme: add support for batched completion of polled IO
Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:45 -06:00
Jens Axboe 5a72e899ce block: add a struct io_comp_batch argument to fops->iopoll()
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:40 -06:00
Christoph Hellwig c712dccc64 nvme-multipath: enable polled I/O
Set the poll queue flag to enable polling, given that the multipath
node just dispatches the bios to a lower queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig 3e08773c38 block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig 6ce913fe3e block: rename REQ_HIPRI to REQ_POLLED
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig fe45e630a1 block: move integrity handling out of <linux/blkdev.h>
Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Jens Axboe baa0ab2ba2 nvme fixes for Linux 5.15:
- fix the abort command id (Keith Busch)
  - nvme: fix per-namespace chardev deletion (Adam Manzanares)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFoKrwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOCGw//fXeDv+hBmuhfmMX5ac889bAT9iWQcg86jb/2UBEF
 KmtFJI6vQO7J4amCImSEoYD55WZ85ZZo+i+Q61abLep493MXINnrm/ZkNtVp0kbT
 h+pokfxbwo4IKPOCmWqVhXxspaIQL++J6TDMvAzuv4S4YAxeD/W+4bXshMu8E4aD
 XZZo2uj1EVYDeHY1rasCElMAw4DZDdaIL7Zn11qqQJDWK+XwP3YJ773/rXoG3i/j
 Kq0jtB/wWRc3VrherMXtEOz1Eu1MVOhUhr6YF2vitv+qBzmJKMResO1bVP619vqQ
 MoEGZmqUfUxLY4CcWbtV4iIIBK+H3ySaEoHeNLu31aiucRcAWCLIwHLvwrc2cSpP
 BozCuA03iRyGtcLDftfQfD3jyzeB8qmIrOaXd/blBlPAuQXzilF/WyB+IzRXorYs
 kxtwdWLTbh3nWMQrEG9gh6Fx02Hbf+dNdXb/GN0NFDJDD6BfPyAut38lMxdRKMAz
 9q4HfPTszyP78sU54N6qULVDVTJZ8F7sjY90OueZCBT3VjwsyCd7eO4sti1Avu4Y
 KmWxKfKQyGCmlls0oIrCFm7A0tjHFJhCdoNN0w3DmpMy6GN0cBE3aI3WAb1MAWan
 Qp6eI9DXuHwsLW8fqLv9+lWP3BI59qj2ClSy3OBGqAWiW1ifyq4cPUOR80vh22c2
 EHA=
 =eQZ/
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15:

 - fix the abort command id (Keith Busch)
 - nvme: fix per-namespace chardev deletion (Adam Manzanares)"

* tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme:
  nvme: fix per-namespace chardev deletion
  nvme-pci: Fix abort command id
2021-10-14 09:07:14 -06:00
Adam Manzanares be5eb93354 nvme: fix per-namespace chardev deletion
Decrease reference count of chardevice during char device deletion in
order to fix a memory leak.  Add a release callabck for the device
associated chardev and move ida_simple_remove into the release function.

Fixes: 2637baed78 ("nvme: introduce generic per-namespace chardev")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Javier González <javier@javigon.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-14 08:07:47 +02:00
Keith Busch 85f74acf09 nvme-pci: Fix abort command id
The request tag is no longer the only component of the command id.

Fixes: e7006de6c2 ("nvme: code command_id with a genctr for use-after-free validation")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2021-10-07 08:02:06 -07:00
Keith Busch a2941f6aa7 nvme: add command id quirk for apple controllers
Some apple controllers use the command id as an index to implementation
specific data structures and will fail if the value is out of bounds.
The nvme driver's recently introduced command sequence number breaks
this controller.

Provide a quirk so these spec incompliant controllers can function as
before. The driver will not have the ability to detect bad completions
when this quirk is used, but we weren't previously checking this anyway.

The quirk bit was selected so that it can readily apply to stable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214509
Cc: Sven Peter <sven@svenpeter.dev>
Reported-by: Orlando Chamberlain <redecorating@protonmail.com>
Reported-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210927154306.387437-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-27 10:02:07 -06:00
Christoph Hellwig 298ba0e3d4 nvme: keep ctrl->namespaces ordered
Various places in the nvme code that rely on ctrl->namespace to be
ordered.  Ensure that the namespae is inserted into the list at the
right position from the start instead of sorting it after the fact.

Fixes: 540c801c65 ("NVMe: Implement namespace list scanning")
Reported-by: Anton Eidelman <anton.eidelman@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-09-21 09:17:15 +02:00
Sagi Grimberg e371af033c nvme-tcp: fix incorrect h2cdata pdu offset accounting
When the controller sends us multiple r2t PDUs in a single
request we need to account for it correctly as our send/recv
context run concurrently (i.e. we get a new r2t with r2t_offset
before we updated our iterator and req->data_sent marker). This
can cause wrong offsets to be sent to the controller.

To fix that, we will first know that this may happen only in
the send sequence of the last page, hence we will take
the r2t_offset to the h2c PDU data_offset, and in
nvme_tcp_try_send_data loop, we make sure to increment
the request markers also when we completed a PDU but
we are expecting more r2t PDUs as we still did not send
the entire data of the request.

Fixes: 825619b09a ("nvme-tcp: fix possible use-after-completion")
Reported-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Tested-by: Nowak, Lukasz <Lukasz.Nowak@Dell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21 09:17:15 +02:00
James Smart bdaa136566 nvme-fc: remove freeze/unfreeze around update_nr_hw_queues
Remove the freeze/unfreeze around changes to the number of hardware
queues. Study and retest has indicated there are no ios that can be
active at this point so there is nothing to freeze.

nvme-fc is draining the queues in the shutdown and error recovery path
in __nvme_fc_abort_outstanding_ios.

This patch primarily reverts 88e837ed0f "nvme-fc: wait for queues to
freeze before calling update_hr_hw_queues". It's not an exact revert as
it leaves the adjusting of hw queues only if the count changes.

Signed-off-by: James Smart <jsmart2021@gmail.com>
[dwagner: added explanation why no IO is pending]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21 09:17:12 +02:00
James Smart e5445dae29 nvme-fc: avoid race between time out and tear down
To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue.

This patch merges the admin and io sync ops into the queue teardown logic
as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time
out and tear down". There is no teardown_lock in nvme-fc.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21 09:17:12 +02:00
Daniel Wagner 555f66d0f8 nvme-fc: update hardware queues before using them
In case the number of hardware queues changes, we need to update the
tagset and the mapping of ctx to hctx first.

If we try to create and connect the I/O queues first, this operation
will fail (target will reject the connect call due to the wrong number
of queues) and hence we bail out of the recreate function. Then we
will to try the very same operation again, thus we don't make any
progress.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-21 09:17:09 +02:00
Jens Axboe 65ed1e692f nvme fixes for Linux 5.15
- fix ANA state updates when a namespace is not present (Anton Eidelman)
  - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
    (Dan Carpenter)
  - avoid race in shutdown namespace removal (Daniel Wagner)
  - fix io_work priority inversion in nvme-tcp (Keith Busch)
  - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFB+aQLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOfPg//fREqsvY1eMkexP/fOyrpVpLH1lLLM2SHUYwyQOtg
 4CVxl2KU8YGyGLGpAMO6ToctrNV83oUQUQlgy7PyQQpUzr6m4sJm7Lbxzvqhyh96
 MKeHRcWo173twiCLICx4GADSmlwofmwUSzBXdhEtt1gNmg26T+nnDewHpSqLY5Ax
 BNAtFkAGTZodupVKUNsI+tqAyXY3Bgk069YqDiZfBCqU62JzLj6QX0aVXcpS2kcv
 bVpJ0kbnuoUkHG/eIrLZeiyvikxQWx1KaLaL3Y24wrdkVWmZJpCYbg/gXq5v9GUQ
 U8yq/iuOSTtfE0+7hNCyNz8tNLfzxejd0jDCfQcPwXOuzIfNkdzdqhkhste2HYB5
 uriDj1eTQq1siH6kKu8oYQBMSxMSw9wM5q71d5QX2XJzw3hl9bqw5GqiT8Gjk1pX
 K759Tj+IMFNa1SqUGI5IBfkHz4vvuwOr+pO60fwd3CzajOI8mNnNAuJ+1I2UJaEP
 cgq+coD950BsWCyhL+XO/NhYPUSugI331JQqlIkVPPYxqBCJK22c7rZusM0Sm4Gq
 aweWKIxtI9cbsrlf6WfJdyHf5clZ9AjMmWEzHFbDeGqUhb3pTFoT5H/UqSw9/6xe
 pi7iKS6lBoZAoWKfyt9bOKfmaDlcrylAxPQ/LfyJGSTy5Buj5bcc4FO5rRJ+BIay
 t88=
 =wad8
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15

 - fix ANA state updates when a namespace is not present (Anton Eidelman)
 - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
   (Dan Carpenter)
 - avoid race in shutdown namespace removal (Daniel Wagner)
 - fix io_work priority inversion in nvme-tcp (Keith Busch)
 - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)"

* tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme:
  nvme-tcp: fix io_work priority inversion
  nvme-rdma: destroy cm id before destroy qp to avoid use after free
  nvme-multipath: fix ANA state updates when a namespace is not present
  nvme: avoid race in shutdown namespace removal
  nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
2021-09-15 07:53:32 -06:00
Christoph Hellwig 9da4c7276e nvme: remove the call to nvme_update_disk_info in nvme_ns_remove
There is no need to explicitly unregister the integrity profile when
deleting the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Keith Busch 70f437fb43 nvme-tcp: fix io_work priority inversion
Dispatching requests inline with the .queue_rq() call may block while
holding the send_mutex. If the tcp io_work also happens to schedule, it
may see the req_list is non-empty, leaving "pending" true and remaining
in TASK_RUNNING. Since io_work is of higher scheduling priority, the
.queue_rq task may not get a chance to run, blocking forward progress
and leading to io timeouts.

Instead of checking for pending requests within io_work, let the queueing
restart io_work outside the send_mutex lock if there is more work to be
done.

Fixes: a0fdd14180 ("nvme-tcp: rerun io_work if req_list is not empty")
Reported-by: Samuel Jones <sjones@kalrayinc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-14 10:32:05 +02:00
Ruozhu Li 9817d763db nvme-rdma: destroy cm id before destroy qp to avoid use after free
We should always destroy cm_id before destroy qp to avoid to get cma
event after qp was destroyed, which may lead to use after free.
In RDMA connection establishment error flow, don't destroy qp in cm
event handler.Just report cm_error to upper level, qp will be destroy
in nvme_rdma_alloc_queue() after destroy cm id.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-14 10:32:04 +02:00
Anton Eidelman 79f528afa9 nvme-multipath: fix ANA state updates when a namespace is not present
nvme_update_ana_state() has a deficiency that results in a failure to
properly update the ana state for a namespace in the following case:

  NSIDs in ctrl->namespaces:	1, 3,    4
  NSIDs in desc->nsids:		1, 2, 3, 4

Loop iteration 0:
    ns index = 0, n = 0, ns->head->ns_id = 1, nsid = 1, MATCH.
Loop iteration 1:
    ns index = 1, n = 1, ns->head->ns_id = 3, nsid = 2, NO MATCH.
Loop iteration 2:
    ns index = 2, n = 2, ns->head->ns_id = 4, nsid = 4, MATCH.

Where the update to the ANA state of NSID 3 is missed.  To fix this
increment n and retry the update with the same ns when ns->head->ns_id is
higher than nsid,

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-09-14 10:32:04 +02:00
Daniel Wagner 9edceaf430 nvme: avoid race in shutdown namespace removal
When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac56 ("nvme: fix refcounting imbalance when all paths are down")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-13 17:04:04 +02:00
Luis Chamberlain ab3994f6ef nvme: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:08:09 +02:00
Daniel Wagner 041bd1a1fc nvme: only call synchronize_srcu when clearing current path
The function nmve_mpath_clear_current_path returns true if the current
path has changed. In this case we have to wait for all concurrent
submissions to finish. But if we didn't change the current path, there
is no point in waiting for another RCU period to finish.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:03:11 +02:00
Tatsuya Sasaki b58da2d270 nvme: update keep alive interval when kato is modified
Currently the connection between host and NVMe-oF target gets
disconnected by keep-alive timeout when a user connects to a target
with a relatively large kato value and then sets the smaller kato
with a set features command (e.g. connects with 60 seconds kato value
and then sets 10 seconds kato value).

The cause is that keep alive command interval on the host, which is
defined as unsigned int kato in nvme_ctrl structure, does not follow
the kato value changes.

This patch updates the keep alive interval in the following steps when
the kato is modified by a set features command: stops the keep alive
work queue, then sets the kato as new timer value and re-start the queue.

Signed-off-by: Tatsuya Sasaki <tatsuya6.sasaki@kioxia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:03:11 +02:00
Daniel Wagner 1ba2e507f5 nvme-tcp: Do not reset transport on data digest errors
The spec says

  7.4.6.1 Digest Error handling

  When a host detects a data digest error in a C2HData PDU, that host
  shall continue processing C2HData PDUs associated with the command and
  when the command processing has completed, if a successful status was
  returned by the controller, the host shall fail the command with a
  non-fatal transport error.

Currently the transport is reseted when a data digest error is
detected. Instead, when a digest error is detected, mark the final
status as NVME_SC_DATA_XFER_ERROR and let the upper layer handle
the error.

In order to keep track of the final result maintain a status field in
nvme_tcp_request object and use it to overwrite the completion queue
status (which might be successful even though a digest error has been
detected) when completing the request.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:03:11 +02:00
Adam Manzanares 43dc987828 nvme: move nvme_multi_css into nvme.h
Preparatory patch in order to reuse nvme_multi_css in the nvme target
code.

Signed-off-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 09:59:02 +02:00
Hannes Reinecke e7d65803e2 nvme-multipath: revalidate paths during rescan
When triggering a rescan due to a namespace resize we will be
receiving AENs on every controller, triggering a rescan of all
attached namespaces. If multipath is active only the current path and
the ns_head disk will be updated, the other paths will still refer to
the old size until AENs for the remaining controllers are received.

If I/O comes in before that it might be routed to one of the old
paths, triggering an I/O failure with 'access beyond end of device'.
With this patch the old paths are skipped from multipath path
selection until the controller serving these paths has been rescanned.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - introduce NVME_NS_READY flag instead of NVME_NS_INVALIDATE
          - use 'revalidate' instead of 'invalidate' which
	    follows the zoned device code path.
	  - clear NVME_NS_READY before clearing current_path]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 09:59:02 +02:00
Christoph Hellwig d32d3d0b47 nvme-multipath: set QUEUE_FLAG_NOWAIT
The nvme multipathing code just dispatches bios to one of the blk-mq
based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT
to support REQ_NOWAIT bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-09-06 09:59:02 +02:00
Linus Torvalds 9a1d6c9e3f for-5.15/drivers-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqnLD/9c8v7WTLjrDR6FLD8fHUmkwk9ss6OeyYJC
 Z62QOyk6BqNOu6FAwBax9wFaboXdUqOdpJU0PVQ7WJ5wBiCQ9DAZY6T+iwW0jE79
 +iOSqdXHVLAIyIM9GplzLH5AH3tx4445bX7fRWwWX1OgmSidkAhb25FusCvpcpHx
 1k+9dSLClLeHPR6jVT3k6tHv2RzPSw+/vYOggeWYA0YYPfoCx/Ft0uwO+PjKpvLQ
 Je5jASlLGYCXazswJBZgfjbroA97EuaLOmHHIHrwhkkFsbV6ewv6mlmanbMEs4fX
 Wh+axTt8so27g6gbw31EOcGsxTi0B37Jx9MOrSla6NdJoZkFE2sn6K+D5k4oeSrg
 QgYXL00U62eSgWmgSB0f0X081cQfI+FUMe5u5S368WdrgCPfaXl11zHw8nXw8gEW
 UvqR4zr3hQd4piXsIWl2bwZrmpPBCeB8iStLq3C92RLPFT6hJO3GM/ZmwTn+0HT0
 lMXzoEdkPywkKWi8aBbSgzXiGknNl8HAYnwMhcQjiHbYQOycGkI9pigJDNY9Ox1l
 fYHFSompmJ/XK8cIiU7QIglXEXJky5jQ89Ni0ryCstOaP20tPxWtkpOCgidXfNGz
 4lmQV8D5aBTUFs6ifPjXfiXUmDiU3SaxiFhAqaEkGII9BbkrNhlibB4LBAU+toi1
 Q0yGhGR/mg==
 =4uWF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Sitting on top of the core block changes, here are the driver changes
  for the 5.15 merge window:

   - NVMe updates via Christoph:
       - suspend improvements for devices with an HMB (Keith Busch)
       - handle double completions more gacefull (Sagi Grimberg)
       - cleanup the selects for the nvme core code a bit (Sagi Grimberg)
       - don't update queue count when failing to set io queues (Ruozhu Li)
       - various nvmet connect fixes (Amit Engel)
       - cleanup lightnvm leftovers (Keith Busch, me)
       - small cleanups (Colin Ian King, Hou Pu)
       - add tracing for the Set Features command (Hou Pu)
       - CMB sysfs cleanups (Keith Busch)
       - add a mutex_destroy call (Keith Busch)

   - remove lightnvm subsystem. It's served its purpose and ultimately
     led to zoned nvme support, we no longer need it (Christoph)

   - revert floppy O_NDELAY fix (Denis)

   - nbd fixes (Hou, Pavel, Baokun)

   - nbd locking fixes (Tetsuo)

   - nbd device removal fixes (Christoph)

   - raid10 rcu warning fix (Xiao)

   - raid1 write behind fix (Guoqing)

   - rnbd fixes (Gioh, Md Haris)

   - misc fixes (Colin)"

* tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits)
  Revert "floppy: reintroduce O_NDELAY fix"
  raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
  md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
  nbd: remove nbd->destroy_complete
  nbd: only return usable devices from nbd_find_unused
  nbd: set nbd->index before releasing nbd_index_mutex
  nbd: prevent IDR lookups from finding partially initialized devices
  nbd: reset NBD to NULL when restarting in nbd_genl_connect
  nbd: add missing locking to the nbd_dev_add error path
  nvme: remove the unused NVME_NS_* enum
  nvme: remove nvm_ndev from ns
  nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
  block: nbd: add sanity check for first_minor
  nvmet: check that host sqsize does not exceed ctrl MQES
  nvmet: avoid duplicate qid in connect cmd
  nvmet: pass back cntlid on successful completion
  nvme-rdma: don't update queue count when failing to set io queues
  nvme-tcp: don't update queue count when failing to set io queues
  nvme-tcp: pair send_mutex init with destroy
  nvme: allow user toggling hmb usage
  ...
2021-08-30 19:01:46 -07:00
Christoph Hellwig 5f432cceb3 nvme: use blk_mq_alloc_disk
Switch to use the blk_mq_alloc_disk helper for allocating the
request_queue and gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210816131910.615153-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:54:30 -06:00
Christoph Hellwig 9891668e43 nvme: remove the unused NVME_NS_* enum
These values are unused now that the lightnvm support is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-08-17 06:20:17 +02:00
Keith Busch 77979058df nvme: remove nvm_ndev from ns
Now that the lightnvm driver is removed, we don't need a pointer to it's
now non-existent struct.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 21:03:18 +02:00
Sagi Grimberg 0866200ed7 nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
Transport drivers need both core and fabrics modules, instead of
selecting both, have the selection transitive such that NVME_FABRICS
selects NVME_CORE and transport drivers select NVME_FABRICS.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 21:03:18 +02:00
Christoph Hellwig 3973e15fa5 nvme: use bvec_virt
Use bvec_virt instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20210804095634.460779-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16 10:50:33 -06:00
Ruozhu Li 85032874f8 nvme-rdma: don't update queue count when failing to set io queues
We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Ruozhu Li 664227fde6 nvme-tcp: don't update queue count when failing to set io queues
We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch d48f92cd27 nvme-tcp: pair send_mutex init with destroy
Each mutex_init() should have a corresponding mutex_destroy().

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch a5df5e79c4 nvme: allow user toggling hmb usage
The NVMe host memory buffer may consume a non-negligable amount of
memory. Controllers are required to function without the host memory
buffer enabled, but with possibly degraded performance. Export a sysfs
property to toggle this feature on a per-device granularity so users may
choose to reclaim memory at the expense of storage performance.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch e5ad96f388 nvme-pci: disable hmb on idle suspend
An idle suspend may or may not disable host memory access from devices
placed in low power mode. Either way, it should always be safe to
disable the host memory buffer prior to entering the low power mode, and
this should also always be faster than a full device shutdown.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Hou Pu a7b5e8d864 nvme: add set feature tracing support
A nvme connect command produces following trace.

Before:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/5:1H-98      [005] ....  3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    kworker/4:1H-124     [004] ....  3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

After:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/0:1H-253     [000] ....   196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007)
    kworker/0:1H-253     [000] ....   196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900)

Using ',' to separate different field like others in
nvmet_trace_admin_get_features.

Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Hou Pu e23439e977 nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options
Opts->host is NULL there. It is checked just before. So remove
nvmf_host_put. It is introduced by commit 59a2f3f00f ("nvme: fix
potential memory leak in option parsing").

Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Keith Busch 1751e97aa9 nvme-pci: cmb sysfs: one file, one value
An attribute should only be exporting one value as recommended in
Documentation/filesystems/sysfs.rst. Implement CMB attributes this way.
The old attribute will remain for backward compatibility.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Keith Busch 0521905e85 nvme-pci: use attribute group for cmb sysfs
Appending sysfs files to the controller kobject is a bit clunky and
becomes a maintenance problem as more attributes are added. The
attribute group infrastructure handles this better, so use that.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg e7006de6c2 nvme: code command_id with a genctr for use-after-free validation
We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg 3b01a9d0ca nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
We already validate it when receiving the c2hdata pdu header
and this is not changing so this is a redundant check.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg 27453b45e6 nvme-pci: limit maximum queue depth to 4095
We are going to use the upper 4-bits of the command_id for a generation
counter, so enforce the new queue depth upper limit. As we enforce
both min and max queue depth, use param_set_uint_minmax istead of
open coding it.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Christoph Hellwig 9ea9b9c483 remove the lightnvm subsystem
Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts
to produce Open Channel SSDs and never made it into the NVMe spec
proper.  They have since been superceeded by NVMe enhancements such
as ZNS support.  Remove the support per the deprecation schedule.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.de
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-14 15:54:09 -06:00
Christoph Hellwig 50b4aecfbb block: remove GENHD_FL_UP
Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12 10:29:36 -06:00
Christoph Hellwig 916a470da0 nvme: replace the GENHD_FL_UP check in nvme_mpath_shutdown_disk
Use the nvme-internal NVME_NSHEAD_DISK_LIVE flag instead of abusing
the block layer state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12 10:29:36 -06:00
Christoph Hellwig 5eba200526 nvme: remove the GENHD_FL_UP check in nvme_ns_remove
Early probe failure never reaches nvme_ns_remove, so GENHD_FL_UP must
be set at this point.  Remove the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12 10:29:36 -06:00
Christoph Hellwig 471aa704db block: pass a gendisk to blk_queue_update_readahead
.. and rename the function to disk_update_readahead.  This is in
preparation for moving the BDI from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 11:52:28 -06:00
Linus Torvalds 4d4a60cede block-5.14-2021-07-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmD8NkkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprfxEACKUWYgb6OSc121lltJTFTZ8LDLxKLn4dNt
 5SREbhPdUXKZ64fTo6dKx3ZQ0SZkY+90aTOQQClSNps3/hsIuGsXcLnaPBgEET21
 GgCaJfT1B4i/NCjguwboprGPQjETLUTXHri8e/5C8OBgFSL4Q62Cvsl2l5U1V9rP
 v4O21TazHNVWnV60uJ4FZvQ56zU/mlvU77eWnfa1MJQaupl1atUwhDNTidDNVMYN
 v69nJqxgYIo5pvNVkMwHxWcp6Ckkwj5GaxptcJ5EPBU69sImBCZ8fpYVrrS3NzTL
 82Xf/HFVmJdegRqk0fctWGOfkorBl1ah/+PhL69In/6y6lHOs9y9+lTHoWq/+z8z
 cPj/Ru6440Gf5D81rzbJRmCKNE4k2ToYnvRTHj2kUSRoAxDih0S3axZMF7m9eEnQ
 DNOG29bsskJhYPx/0HxffCTAsKf2iPlMCRQFTqR296F5Hbg1eMo9rDSXY5JCtlNU
 jOKKPmBM5fZWn19QIvjlxICsS3TRRun4jpcpbQZKoY1ItuN9x5ANmlr54K7T4gD3
 RYXZ51dPDaEFQGO0bwKlRj6w45F7wZb7HBNufXl2WsuSbM88sGEXUkYlBQ55xjJK
 Y8+amkzA2XUorQIrfsegzkCw67CuCB7Pg2IQWoiyFLagBTfoX8QJPslRRwUFiBav
 QZi3lITl5Q==
 =uzwV
 -----END PGP SIGNATURE-----

Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request (Christoph):
    - tracing fix (Keith Busch)
    - fix multipath head refcounting (Hannes Reinecke)
    - Write Zeroes vs PI fix (me)
    - drop a bogus WARN_ON (Zhihao Cheng)

 - Increase max blk-cgroup policy size, now that mq-deadline
   uses it too (Oleksandr)

* tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
  nvme: set the PRACT bit when using Write Zeroes with T10 PI
  nvme: fix nvme_setup_command metadata trace event
  nvme: fix refcounting imbalance when all paths are down
  nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
  block: increase BLKCG_MAX_POLS
2021-07-24 12:57:06 -07:00
Christoph Hellwig aaeb7bb061 nvme: set the PRACT bit when using Write Zeroes with T10 PI
When using Write Zeroes on a namespace that has protection
information enabled they behavior without the PRACT bit
counter-intuitive and will generally lead to validation failures
when reading the written blocks.  Fix this by always setting the
PRACT bit that generates matching PI data on the fly.

Fixes: 6e02318eae ("nvme: add support for the Write Zeroes command")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-21 17:24:10 +02:00
Keith Busch 234211b8dd nvme: fix nvme_setup_command metadata trace event
The metadata address is set after the trace event, so the trace is not
capturing anything useful. Rather than logging the memory address, it's
useful to know if the command carries a metadata payload, so change the
trace event to log that true/false state instead.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-21 09:55:44 +02:00
Hannes Reinecke 5396fdac56 nvme: fix refcounting imbalance when all paths are down
When the last path to a ns_head drops the current code
removes the ns_head from the subsystem list, but will only
delete the disk itself if the last reference to the ns_head
drops. This is causing an refcounting imbalance eg when
applications have a reference to the disk, as then they'll
never get notified that the disk is in fact dead.
This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(),
ensuring that the disk can be properly removed and applications get the
appropriate notifications.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-21 09:55:40 +02:00
Zhihao Cheng 7764656b10 nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
Followling process:
nvme_probe
  nvme_reset_ctrl
    nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)
    queue_work(nvme_reset_wq, &ctrl->reset_work)

-------------->	nvme_remove
		  nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING)
worker_thread
  process_one_work
    nvme_reset_work
    WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)

, which will trigger WARN_ON in nvme_reset_work():
[  127.534298] WARNING: CPU: 0 PID: 139 at drivers/nvme/host/pci.c:2594
[  127.536161] CPU: 0 PID: 139 Comm: kworker/u8:7 Not tainted 5.13.0
[  127.552518] Call Trace:
[  127.552840]  ? kvm_sched_clock_read+0x25/0x40
[  127.553936]  ? native_send_call_func_single_ipi+0x1c/0x30
[  127.555117]  ? send_call_function_single_ipi+0x9b/0x130
[  127.556263]  ? __smp_call_single_queue+0x48/0x60
[  127.557278]  ? ttwu_queue_wakelist+0xfa/0x1c0
[  127.558231]  ? try_to_wake_up+0x265/0x9d0
[  127.559120]  ? ext4_end_io_rsv_work+0x160/0x290
[  127.560118]  process_one_work+0x28c/0x640
[  127.561002]  worker_thread+0x39a/0x700
[  127.561833]  ? rescuer_thread+0x580/0x580
[  127.562714]  kthread+0x18c/0x1e0
[  127.563444]  ? set_kthread_struct+0x70/0x70
[  127.564347]  ret_from_fork+0x1f/0x30

The preceding problem can be easily reproduced by executing following
script (based on blktests suite):
test() {
  pdev="$(_get_pci_dev_from_blkdev)"
  sysfs="/sys/bus/pci/devices/${pdev}"
  for ((i = 0; i < 10; i++)); do
    echo 1 > "$sysfs/remove"
    echo 1 > /sys/bus/pci/rescan
  done
}

Since the device ctrl could be updated as an non-RESETTING state by
repeating probe/remove in userspace (which is a normal situation), we
can replace stack dumping WARN_ON with a warnning message.

Fixes: 82b057caef ("nvme-pci: fix multiple ctrl removal schedulin")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
2021-07-21 09:55:40 +02:00
Linus Torvalds 0d18c12b28 block-5.14-2021-07-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDxlZAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjrgD/49ZEQ6XU6EsEAdcm6A4vZDLw4i5g6ZdyOE
 +SNyn/vgAP0q4g/C8R0DiC4/kubZrMqWnXvmAWdbODnTUWi2xvejwsSMtaK2Plyw
 WxOY/AI1SZUOUDKuibZBCDQMJ95mI+v0aEZyvgYjULzw+2D9X3wVYhsjbvQFsW+A
 CdmjkQGtLWiGkEWDWRunn3XQvKvRQBrb89yv03VJF9RmkgbMkWKPAwDN9vcwtYXY
 OLgLApbeLeEtqNU509NqdBrvPcD6E3gnxbWYSl5XinzI8MmglfHzQBuwB38lDmGS
 +oSntoqDQw1FRc5D/X196ToIGKKGC+0rICVDSctpdGUbcYCNwc8Jj6ZZXD3x9Y/W
 WA7gprHPyBQvpcKiL82TNffXk29ZHzbIShJVmMubjZfEGlUu9m7ZHcssYJUja/zu
 XHVlE4GBCFq/cTE40c7sUImwHIofdFHZr8EdyGTgBKNSsqmPtoow/PCL15IbjZeK
 2AZVG+HP5PtDw7PbZlHECaUJV/AE8vb4MK2PA3cftbW+fWqfVsxvmzdWA1GmKStG
 uw/FWbW0L9y09vHJRpDqp+oeLF6qxIPqMRrgJG5zZJvai9hqMFg0HTWjmUMyGLKv
 QziEiDM44+wkLr0479GElHHLAMscPsbzBA4kEVJPWlIvLSNKZnQq/xbh/Apq6yXR
 ORo+S1biZg==
 =Zeni
 -----END PGP SIGNATURE-----

Merge tag 'block-5.14-2021-07-16' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe fixes via Christoph:
      - fix various races in nvme-pci when shutting down just after
        probing (Casey Chen)
      - fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)

 - Fix regression in xen-blkfront by cleaning up the removal state
   machine (Christoph)

 - Fix tag_set and queue cleanup ordering regression in nbd (Wang)

 - Fix tag_set and queue cleanup ordering regression in pd (Guoqing)

* tag 'block-5.14-2021-07-16' of git://git.kernel.dk/linux-block:
  xen-blkfront: sanitize the removal state machine
  nbd: fix order of cleaning up the queue and freeing the tagset
  pd: fix order of cleaning up the queue and freeing the tagset
  nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
  nvme-pci: fix multiple races in nvme_setup_io_queues
  nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
2021-07-16 12:31:44 -07:00
Casey Chen 251ef6f71b nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
nvme_dev_remove_admin could free dev->admin_q and the admin_tagset
while they are being accessed by nvme_dev_disable(), which can be called
by nvme_reset_work via nvme_remove_dead_ctrl.

Commit cb4bfda62a ("nvme-pci: fix hot removal during error handling")
intended to avoid requests being stuck on a removed controller by killing
the admin queue. But the later fix c8e9e9b764 ("nvme-pci: unquiesce
admin queue on shutdown"), together with nvme_dev_disable(dev, true)
right before nvme_dev_remove_admin() could help dispatch requests and
fail them early, so we don't need nvme_dev_remove_admin() any more.

Fixes: cb4bfda62a ("nvme-pci: fix hot removal during error handling")
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-13 12:03:20 +02:00
Casey Chen e4b9852a0f nvme-pci: fix multiple races in nvme_setup_io_queues
Below two paths could overlap each other if we power off a drive quickly
after powering it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

If B3 runs in between A1 and A2, it will crash if irqaction haven't
been freed by A2. B2 is supposed to free admin queue IRQ but it simply
can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as the NVMEQ_ENABLED bit
gets cleared.

After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
is checking irqaction. A3 also could race with B2 if B2 is freeing
IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

A7 could get chance to pci_free_irq() for certain IO queue while B3 is
checking irqaction.

Fix: A7 takes lock.

nvme_dev->online_queues need to be protected by shutdown_lock. Since it
is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-13 11:40:42 +02:00
Prabhakar Kushwaha 8b43ced64d nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
dev_get_by_name() finds network device by name but it also increases the
reference count.

If a nvme-tcp queue is present and the network device driver is removed
before nvme_tcp, we will face the following continuous log:

  "kernel:unregister_netdevice: waiting for <eth> to become free. Usage count = 2"

And rmmod further halts. Similar case arises during reboot/shutdown
with nvme-tcp queue present and both never completes.

To fix this, use __dev_get_by_name() which finds network device by
name without increasing any reference counter.

Fixes: 3ede8f72a9 ("nvme-tcp: allow selecting the network interface for connections")
Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
[hch: remove the ->ndev member entirely]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-13 11:34:24 +02:00
Linus Torvalds a022f7d575 block-5.14-2021-07-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDnGVYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv6UEAC78zkseI8TmKaowNfkz/+MkP9eSFb1pVn3
 rxpbPOsZompHoZpeWt4oHL+3Rmm3a9iRo/APA2ELas4zvp+Q+6uG7eha2Dc4hUA9
 YgeO4z9YfG8wQNZc3x7bncb6ZwqEE5nnbFe/m25SyrAZVLlZ7FKHxfoZDqjhlGFC
 eLNiYO6vdvwgCoBMcotyCDttrPfEu6947/5vB1zevv57twdQQaEWGUhvyx1XrlDX
 0YD5fmdOjNU2isgxt4xo2Ur2zL6w254/hvj58sV3Z7JfkJpI9DCK+ztKEfzuyEhA
 WYz06rDAT1+1KuVLfowaZ+pYiPPOIsL0+QXI83r3nLaE7WGGlfS8Hmz//1FbziYs
 ZSZI826kEN+/lKeWTcKOOMhmkYyXEFFuQZS34eg9KI4xwML8v+ILlHmcp+tjebw9
 vzNF6f7N2ki+jnyxxyNxeMHxeAMWsqnIRROOhZg6bbs6UVNpDy4qRzpQaDOaJsVe
 uSAQ6PTd/etR9KE+ClhLe6X7Rmp/lfZCPe64wqM/3k1qV2KWhE1fwCQO4c5o1MBN
 rpk3Ef5PZYP3aakCvZnfcjMWlpZNbq/xMc6vPc+yq32akq1t1KbODVBiR5odcH0C
 Gt5N11im50SO06haBt7EOe4JMQLbK5sxG15t4C6mNQZgPegGfaLlVkKpzIkOzUha
 OkRofKMcDA==
 =gHse
 -----END PGP SIGNATURE-----

Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A combination of changes that ended up depending on both the driver
  and core branch (and/or the IDE removal), and a few late arriving
  fixes. In detail:

   - Fix io ticks wrap-around issue (Chunguang)

   - nvme-tcp sock locking fix (Maurizio)

   - s390-dasd fixes (Kees, Christoph)

   - blk_execute_rq polling support (Keith)

   - blk-cgroup RCU iteration fix (Yu)

   - nbd backend ID addition (Prasanna)

   - Partition deletion fix (Yufen)

   - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)

   - Removal of now dead block request types due to IDE removal
     (Christoph)

   - Loop probing and control device cleanups (Christoph)

   - Device uevent fix (Christoph)

   - Misc cleanups/fixes (Tetsuo, Christoph)"

* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
  blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
  block: fix the problem of io_ticks becoming smaller
  nvme-tcp: can't set sk_user_data without write_lock
  loop: remove unused variable in loop_set_status()
  block: remove the bdgrab in blk_drop_partitions
  block: grab a device refcount in disk_uevent
  s390/dasd: Avoid field over-reading memcpy()
  dasd: unexport dasd_set_target_state
  block: check disk exist before trying to add partition
  ubd: remove dead code in ubd_setup_common
  nvme: use return value from blk_execute_rq()
  block: return errors from blk_execute_rq()
  nvme: use blk_execute_rq() for passthrough commands
  block: support polling through blk_execute_rq
  block: remove REQ_OP_SCSI_{IN,OUT}
  block: mark blk_mq_init_queue_data static
  loop: rewrite loop_exit using idr_for_each_entry
  loop: split loop_lookup
  loop: don't allow deleting an unspecified loop device
  loop: move loop_ctl_mutex locking into loop_add
  ...
2021-07-09 12:05:33 -07:00
Linus Torvalds bd31b9efbf SCSI misc on 20210702
This series consists of the usual driver updates (ufs, ibmvfc,
 megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
 elx and mpi3mr being new drivers.  The major core change is a rework
 to drop the status byte handling macros and the old bit shifted
 definitions and the rest of the updates are minor fixes.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYN7I6iYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXpRAQCkngYZ
 35yQrqOxgOk2pfrysE95tHrV1MfJm2U49NFTwAEAuZutEvBUTfBF+sbcJ06r6q7i
 H0hkJN/Io7enFs5v3WA=
 =zwIa
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, ibmvfc,
  megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
  elx and mpi3mr being new drivers.

  The major core change is a rework to drop the status byte handling
  macros and the old bit shifted definitions and the rest of the updates
  are minor fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
  scsi: aha1740: Avoid over-read of sense buffer
  scsi: arcmsr: Avoid over-read of sense buffer
  scsi: ips: Avoid over-read of sense buffer
  scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
  scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
  scsi: elx: libefc: Fix less than zero comparison of a unsigned int
  scsi: elx: efct: Fix pointer error checking in debugfs init
  scsi: elx: efct: Fix is_originator return code type
  scsi: elx: efct: Fix link error for _bad_cmpxchg
  scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
  scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
  scsi: elx: efct: Fix error handling in efct_hw_init()
  scsi: elx: efct: Remove redundant initialization of variable lun
  scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
  scsi: lpfc: Fix build error in lpfc_scsi.c
  scsi: target: iscsi: Remove redundant continue statement
  scsi: qla4xxx: Remove redundant continue statement
  scsi: ppa: Switch to use module_parport_driver()
  scsi: imm: Switch to use module_parport_driver()
  scsi: mpt3sas: Fix error return value in _scsih_expander_add()
  ...
2021-07-02 15:14:36 -07:00
Keith Busch ae5e6886b4 nvme: use return value from blk_execute_rq()
We don't have an nvme status to report if the driver's .queue_rq()
returns an error without dispatching the requested nvme command. Check
the return value from blk_execute_rq() for all passthrough commands so
the caller may know their command was not successful.

If the command is from the target passthrough interface and fails to
dispatch, synthesize the response back to the host as a internal target
error.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-5-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 15:35:45 -06:00
Keith Busch be42a33b92 nvme: use blk_execute_rq() for passthrough commands
The generic blk_execute_rq() knows how to handle polled completions. Use
that instead of implementing an nvme specific handler.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 15:35:38 -06:00
Linus Torvalds 440462198d for-5.14/drivers-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbd5UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvsNEADCJKP81boFzRcdJo7EqaNDAzZyKOIg9Oq7
 4GZE0Wm0SgA6+04bKrNVd9KLcKvQ+NC1pK7UJemSSH2y9ir+zHfyYgAV0/+wFmYm
 NgHlDjBvf80XSI5wezcb6MxZT+R7IaIpDsW1ZvV9hFtPSncn5o2OIWiSdJtHT/Rv
 enlgZPc7OwNWoVMX8eR58IoO0k3S6GLpctUZHt/AUukaKgoOks0X523qhEPf3Upr
 RkbIZuqLWVgpdT6457iSE/OijUczD4thTI8bdprxzhgimOm2vV52sO6F5HtHc7GX
 qW+PWYUaiUk7UpObuOuyv0yyUG45ii73iY1W0w66RiyCjVTgtpdwwMQ38VlBcoOg
 zcE1jneAEJt6TiS6zfRaER/10JoCIG4gp1+apPuaXud/o3BqWI0cagVHAgaLziBI
 F7bDJkbJZIR6GrWMgemBI+mc5/LACBePxzPGLScKFptejtQ/ysfZQ6aCLROJWB2U
 4EnysAaUBf6tywj30JqfQvqFNGkHIgY95FKiXJW6GzqqwgBouNf48vS15BgkwI+2
 EijcqUhlOVNfc3RIc0ZL5c9KcPIN9t5sqBrWZe3wgCErhxAx6w6Za9nDdP+US9bl
 /apCpvDFlu59g8n1wtkNE/uC+XqdKDwsplYhnfpX0FGni5wIknhQq3bSe4dPFgSn
 pG5VMrw3pA==
 =D6dS
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Pretty calm round, mostly just NVMe and a bit of MD:

   - NVMe updates (via Christoph)
        - improve the APST configuration algorithm (Alexey Bogoslavsky)
        - look for StorageD3Enable on companion ACPI device
          (Mario Limonciello)
        - allow selecting the network interface for TCP connections
          (Martin Belanger)
        - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
          Christoph)
        - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
          for certain AMD CPUs (Mario Limonciello)
        - zoned device support for nvmet (Chaitanya Kulkarni)
        - fix the rules for changing the serial number in nvmet
          (Noam Gottlieb)
        - various small fixes and cleanups (Dan Carpenter, JK Kim,
          Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
          Uytterhoeven, Daniel Wagner)

   - MD updates (Via Song)
        - iostats rewrite (Guoqing Jiang)
        - raid5 lock contention optimization (Gal Ofri)

   - Fall through warning fix (Gustavo)

   - Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
  nvmet: use NVMET_MAX_NAMESPACES to set nn value
  loop: Fix missing discard support when using LOOP_CONFIGURE
  nvme.h: add missing nvme_lba_range_type endianness annotations
  nvme: remove zeroout memset call for struct
  nvme-pci: remove zeroout memset call for struct
  nvmet: remove zeroout memset call for struct
  nvmet: add ZBD over ZNS backend support
  nvmet: add Command Set Identifier support
  nvmet: add nvmet_req_bio put helper for backends
  nvmet: add req cns error complete helper
  block: export blk_next_bio()
  nvmet: remove local variable
  nvmet: use nvme status value directly
  nvmet: use u32 type for the local variable nsid
  nvmet: use u32 for nvmet_subsys max_nsid
  nvmet: use req->cmd directly in file-ns fast path
  nvmet: use req->cmd directly in bdev-ns fast path
  nvmet: make ver stable once connection established
  nvmet: allow mn change if subsys not discovered
  nvmet: make sn stable once connection was established
  ...
2021-06-30 12:21:16 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Chaitanya Kulkarni cc72c44267 nvme: remove zeroout memset call for struct
Declare and initialize structure variables to zero values so that we can
remove zeroout memset calls in the host/core.c.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni f66e2804d6 nvme-pci: remove zeroout memset call for struct
Declare and initialize structure variables to zero values so that we can
remove zeroout memset calls in the host/pci.c.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni eff4423ec0 nvme-fabrics: remove memset in connect io q
Declare and initialize structure variable to the zero values so that we
can get rid of the zeroout memset call.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni bfa9d1222d nvme-fabrics: remove memset in connect admin q
Declare and initialize structure variable to the zero values so that we
can get rid of the zeroout memset call.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni c22c272013 nvme-fabrics: remove memset in nvmf_reg_write32()
Declare and initialize structure variable to the zero values so that we
can get rid of the zeroout memset call.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni 2796a8e409 nvme-fabrics: remove memset in nvmf_reg_read64()
Declare and initialize structure variable to the zero values so that we
can get rid of the zeroout memset call.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni 3b54064fbc nvme-tcp: use ctrl sgl check helper
Use the helper to check NVMe controller's SGL support.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni 253a0b76a1 nvme-pci: use ctrl sgl check helper
Use the helper to check NVMe controller's SGL support.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:19 +02:00
Chaitanya Kulkarni b61678bcd4 nvme-fc: use ctrl sgl check helper
Use the helper to check NVMe controller's SGL support.

Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:18 +02:00
Chaitanya Kulkarni 73eefc270a nvme: add a helper to check ctrl sgl support
For various transports such as fc/tcp/pci it is common to check if
NVMe SGLs are supported or not by the controller.

In this preparation patch we add a helper to avoid the open coding of
such checks in the various transport.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:18 +02:00
Chaitanya Kulkarni cb1b10e7ac nvme-pci: remove trailing lines for helpers
Remove the extra white line at the end of the functions.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:18 +02:00
JK Kim a0aac973a2 nvme-pci: fix var. type for increasing cq_head
nvmeq->cq_head is compared with nvmeq->q_depth and changed the value
and cq_phase for handling the next cq db.

but, nvmeq->q_depth's type is u32 and max. value is 0x10000 when
CQP.MSQE is 0xffff and io_queue_depth is 0x10000.

current temp. variable for comparing with nvmeq->q_depth is overflowed
when previous nvmeq->cq_head is 0xffff.

in this case, nvmeq->cq_phase is not updated.
so, fix data type for temp. variable to u32.

Signed-off-by: JK Kim <jongkang.kim2@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:18 +02:00