Remove the freeze/unfreeze around changes to the number of hardware
queues. Study and retest has indicated there are no ios that can be
active at this point so there is nothing to freeze.
nvme-fc is draining the queues in the shutdown and error recovery path
in __nvme_fc_abort_outstanding_ios.
This patch primarily reverts 88e837ed0f "nvme-fc: wait for queues to
freeze before calling update_hr_hw_queues". It's not an exact revert as
it leaves the adjusting of hw queues only if the count changes.
Signed-off-by: James Smart <jsmart2021@gmail.com>
[dwagner: added explanation why no IO is pending]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue.
This patch merges the admin and io sync ops into the queue teardown logic
as shown in the RDMA patch 3017013dcc "nvme-rdma: avoid race between time
out and tear down". There is no teardown_lock in nvme-fc.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In case the number of hardware queues changes, we need to update the
tagset and the mapping of ctx to hctx first.
If we try to create and connect the I/O queues first, this operation
will fail (target will reject the connect call due to the wrong number
of queues) and hence we bail out of the recreate function. Then we
will to try the very same operation again, thus we don't make any
progress.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDnGVYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv6UEAC78zkseI8TmKaowNfkz/+MkP9eSFb1pVn3
rxpbPOsZompHoZpeWt4oHL+3Rmm3a9iRo/APA2ELas4zvp+Q+6uG7eha2Dc4hUA9
YgeO4z9YfG8wQNZc3x7bncb6ZwqEE5nnbFe/m25SyrAZVLlZ7FKHxfoZDqjhlGFC
eLNiYO6vdvwgCoBMcotyCDttrPfEu6947/5vB1zevv57twdQQaEWGUhvyx1XrlDX
0YD5fmdOjNU2isgxt4xo2Ur2zL6w254/hvj58sV3Z7JfkJpI9DCK+ztKEfzuyEhA
WYz06rDAT1+1KuVLfowaZ+pYiPPOIsL0+QXI83r3nLaE7WGGlfS8Hmz//1FbziYs
ZSZI826kEN+/lKeWTcKOOMhmkYyXEFFuQZS34eg9KI4xwML8v+ILlHmcp+tjebw9
vzNF6f7N2ki+jnyxxyNxeMHxeAMWsqnIRROOhZg6bbs6UVNpDy4qRzpQaDOaJsVe
uSAQ6PTd/etR9KE+ClhLe6X7Rmp/lfZCPe64wqM/3k1qV2KWhE1fwCQO4c5o1MBN
rpk3Ef5PZYP3aakCvZnfcjMWlpZNbq/xMc6vPc+yq32akq1t1KbODVBiR5odcH0C
Gt5N11im50SO06haBt7EOe4JMQLbK5sxG15t4C6mNQZgPegGfaLlVkKpzIkOzUha
OkRofKMcDA==
=gHse
-----END PGP SIGNATURE-----
Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A combination of changes that ended up depending on both the driver
and core branch (and/or the IDE removal), and a few late arriving
fixes. In detail:
- Fix io ticks wrap-around issue (Chunguang)
- nvme-tcp sock locking fix (Maurizio)
- s390-dasd fixes (Kees, Christoph)
- blk_execute_rq polling support (Keith)
- blk-cgroup RCU iteration fix (Yu)
- nbd backend ID addition (Prasanna)
- Partition deletion fix (Yufen)
- Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)
- Removal of now dead block request types due to IDE removal
(Christoph)
- Loop probing and control device cleanups (Christoph)
- Device uevent fix (Christoph)
- Misc cleanups/fixes (Tetsuo, Christoph)"
* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
block: fix the problem of io_ticks becoming smaller
nvme-tcp: can't set sk_user_data without write_lock
loop: remove unused variable in loop_set_status()
block: remove the bdgrab in blk_drop_partitions
block: grab a device refcount in disk_uevent
s390/dasd: Avoid field over-reading memcpy()
dasd: unexport dasd_set_target_state
block: check disk exist before trying to add partition
ubd: remove dead code in ubd_setup_common
nvme: use return value from blk_execute_rq()
block: return errors from blk_execute_rq()
nvme: use blk_execute_rq() for passthrough commands
block: support polling through blk_execute_rq
block: remove REQ_OP_SCSI_{IN,OUT}
block: mark blk_mq_init_queue_data static
loop: rewrite loop_exit using idr_for_each_entry
loop: split loop_lookup
loop: don't allow deleting an unspecified loop device
loop: move loop_ctl_mutex locking into loop_add
...
This series consists of the usual driver updates (ufs, ibmvfc,
megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
elx and mpi3mr being new drivers. The major core change is a rework
to drop the status byte handling macros and the old bit shifted
definitions and the rest of the updates are minor fixes.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYN7I6iYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXpRAQCkngYZ
35yQrqOxgOk2pfrysE95tHrV1MfJm2U49NFTwAEAuZutEvBUTfBF+sbcJ06r6q7i
H0hkJN/Io7enFs5v3WA=
=zwIa
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (ufs, ibmvfc,
megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with
elx and mpi3mr being new drivers.
The major core change is a rework to drop the status byte handling
macros and the old bit shifted definitions and the rest of the updates
are minor fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits)
scsi: aha1740: Avoid over-read of sense buffer
scsi: arcmsr: Avoid over-read of sense buffer
scsi: ips: Avoid over-read of sense buffer
scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe()
scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame()
scsi: elx: libefc: Fix less than zero comparison of a unsigned int
scsi: elx: efct: Fix pointer error checking in debugfs init
scsi: elx: efct: Fix is_originator return code type
scsi: elx: efct: Fix link error for _bad_cmpxchg
scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel()
scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session()
scsi: elx: efct: Fix error handling in efct_hw_init()
scsi: elx: efct: Remove redundant initialization of variable lun
scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected"
scsi: lpfc: Fix build error in lpfc_scsi.c
scsi: target: iscsi: Remove redundant continue statement
scsi: qla4xxx: Remove redundant continue statement
scsi: ppa: Switch to use module_parport_driver()
scsi: imm: Switch to use module_parport_driver()
scsi: mpt3sas: Fix error return value in _scsih_expander_add()
...
The generic blk_execute_rq() knows how to handle polled completions. Use
that instead of implementing an nvme specific handler.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbd5UQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvsNEADCJKP81boFzRcdJo7EqaNDAzZyKOIg9Oq7
4GZE0Wm0SgA6+04bKrNVd9KLcKvQ+NC1pK7UJemSSH2y9ir+zHfyYgAV0/+wFmYm
NgHlDjBvf80XSI5wezcb6MxZT+R7IaIpDsW1ZvV9hFtPSncn5o2OIWiSdJtHT/Rv
enlgZPc7OwNWoVMX8eR58IoO0k3S6GLpctUZHt/AUukaKgoOks0X523qhEPf3Upr
RkbIZuqLWVgpdT6457iSE/OijUczD4thTI8bdprxzhgimOm2vV52sO6F5HtHc7GX
qW+PWYUaiUk7UpObuOuyv0yyUG45ii73iY1W0w66RiyCjVTgtpdwwMQ38VlBcoOg
zcE1jneAEJt6TiS6zfRaER/10JoCIG4gp1+apPuaXud/o3BqWI0cagVHAgaLziBI
F7bDJkbJZIR6GrWMgemBI+mc5/LACBePxzPGLScKFptejtQ/ysfZQ6aCLROJWB2U
4EnysAaUBf6tywj30JqfQvqFNGkHIgY95FKiXJW6GzqqwgBouNf48vS15BgkwI+2
EijcqUhlOVNfc3RIc0ZL5c9KcPIN9t5sqBrWZe3wgCErhxAx6w6Za9nDdP+US9bl
/apCpvDFlu59g8n1wtkNE/uC+XqdKDwsplYhnfpX0FGni5wIknhQq3bSe4dPFgSn
pG5VMrw3pA==
=D6dS
-----END PGP SIGNATURE-----
Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Pretty calm round, mostly just NVMe and a bit of MD:
- NVMe updates (via Christoph)
- improve the APST configuration algorithm (Alexey Bogoslavsky)
- look for StorageD3Enable on companion ACPI device
(Mario Limonciello)
- allow selecting the network interface for TCP connections
(Martin Belanger)
- misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
Christoph)
- move the ACPI StorageD3 code to drivers/acpi/ and add quirks
for certain AMD CPUs (Mario Limonciello)
- zoned device support for nvmet (Chaitanya Kulkarni)
- fix the rules for changing the serial number in nvmet
(Noam Gottlieb)
- various small fixes and cleanups (Dan Carpenter, JK Kim,
Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
Uytterhoeven, Daniel Wagner)
- MD updates (Via Song)
- iostats rewrite (Guoqing Jiang)
- raid5 lock contention optimization (Gal Ofri)
- Fall through warning fix (Gustavo)
- Misc fixes (Gustavo, Jiapeng)"
* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
nvmet: use NVMET_MAX_NAMESPACES to set nn value
loop: Fix missing discard support when using LOOP_CONFIGURE
nvme.h: add missing nvme_lba_range_type endianness annotations
nvme: remove zeroout memset call for struct
nvme-pci: remove zeroout memset call for struct
nvmet: remove zeroout memset call for struct
nvmet: add ZBD over ZNS backend support
nvmet: add Command Set Identifier support
nvmet: add nvmet_req_bio put helper for backends
nvmet: add req cns error complete helper
block: export blk_next_bio()
nvmet: remove local variable
nvmet: use nvme status value directly
nvmet: use u32 type for the local variable nsid
nvmet: use u32 for nvmet_subsys max_nsid
nvmet: use req->cmd directly in file-ns fast path
nvmet: use req->cmd directly in bdev-ns fast path
nvmet: make ver stable once connection established
nvmet: allow mn change if subsys not discovered
nvmet: make sn stable once connection was established
...
Use the helper to check NVMe controller's SGL support.
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a new sysfs attribute, appid_store, which can be used to set the
application identifier in the blkcg associated with a cgroup id.
Below is the interface provided to set the app_id:
echo "<cgroupid>:<appid>" >> /sys/class/fc/fc_udev_device/appid_store
echo "457E:100000109b521d27" >> /sys/class/fc/fc_udev_device/appid_store
Link: https://lore.kernel.org/r/20210608043556.274139-4-muneendra.kumar@broadcom.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Returning an nvme status from nvme_fc_create_association() indicates
that the association is established, and we should honour the DNR bit.
If it's set a reconnect attempt will just return the same error, so
we can short-circuit the reconnect attempts and fail the connection
directly.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The __nvmf_check_ready() routine used to bounce all filesystem io if the
controller state isn't LIVE. However, a later patch changed the logic so
that it rejection ends up being based on the Q live check. The FC
transport has a slightly different sequence from rdma and tcp for
shutting down queues/marking them non-live. FC marks its queue non-live
after aborting all ios and waiting for their termination, leaving a
rather large window for filesystem io to continue to hit the transport.
Unfortunately this resulted in filesystem I/O or applications seeing I/O
errors.
Change the FC transport to mark the queues non-live at the first sign of
teardown for the association (when I/O is initially terminated).
Fixes: 73a5379937 ("nvme-fabrics: allow to queue requests for live queues")
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
restarts the admin queue, users are able to submit admin commands to a
controller before reset_work() completes. Commands submitted under this
condition may interfere with commands that performs identify, IO queue
setup in reset_work(), and may result in a hang described in the
following patch.
As seen in the fabrics, user commands are prevented from being executed
under inproper controller states. We may reuse this logic to maintain a
clear admin queue during reset_work().
Signed-off-by: Tao Chiu <taochiu@synology.com>
Signed-off-by: Cody Wong <codywong@synology.com>
Reviewed-by: Leon Chien <leonchien@synology.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
SGLs support is mandatory for NVMe/FC, make sure that the target is
aligned to the specification.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
All nvme transport drivers preallocate an nvme command for each request.
Assume to use that command for nvme_setup_cmd() instead of requiring
drivers pass a pointer to it. All nvme drivers must initialize the
generic nvme_request 'cmd' to point to the transport's preallocated
nvme_command.
The generic nvme_request cmd pointer had previously been used only as a
temporary copy for passthrough commands. Since it now points to the
command that gets dispatched, passthrough commands must directly set it
up prior to executing the request.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The nvme_fc_rcv_ls_req() function has first argument as pointer to
remoteport named portprt, but in the documentation comment that is name
is used as remoteport. Fix that to get rid if the compilation warning.
drivers/nvme//host/fc.c:1724: warning: Function parameter or member 'portptr' not described in 'nvme_fc_rcv_ls_req'
drivers/nvme//host/fc.c:1724: warning: Excess function parameter 'remoteport' description in 'nvme_fc_rcv_ls_req'
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This is a prep patch so that we can move the identify data structure
related code initialization from nvme_init_identify() into a helper.
Rename the function nvmet_init_identify() to nvmet_init_ctrl_finish().
Next patch will move the nvme_id_ctrl related initialization from newly
renamed function nvme_init_ctrl_finish() into the nvme_init_identify()
helper.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fabrics drivers currently reserve two tags on the admin queue. But
given that the connect command is only run on a freshly created queue
or after all commands have been force aborted we only need to reserve
a single tag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Recent patch to prevent calling __nvme_fc_abort_outstanding_ios in
interrupt context results in a possible race condition. A controller
reset results in errored io completions, which schedules error
work. The change of error work to a work element allows it to fire
after the ctrl state transition to NVME_CTRL_CONNECTING, causing
any outstanding io (used to initialize the controller) to fail and
cause problems for connect_work.
Add a state check to only schedule error work if not in the RESETTING
state.
Fixes: 19fce0470f ("nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context")
Signed-off-by: Nigel Kirkland <nkirkland2304@gmail.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When a command has been aborted we should return NVME_SC_HOST_ABORTED_CMD
to be consistent with the other transports.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_fc_terminate_exchange() is being called when exchanges are
being deleted, and as such we should be setting the NVME_REQ_CANCELLED
flag to have identical behaviour on all transports.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The only usage of these is to put their addresses in arrays of pointers
to const attribute_groups. Make them const to allow the compiler to put
them in read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
used to be called from a timeout or work context. Now it is being called
in an io completion context, which can be an interrupt handler.
Unfortunately, the abort outstanding ios routine attempts to stop nvme
queues and nested routines that may try to sleep, which is in conflict
with the interrupt handler.
Correct replacing the direct call with a work element scheduling, and the
abort outstanding ios routine will be called in the work element.
Fixes: 95ced8a2c7 ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
__nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
Consoldate and move the functionality of terminate_io into reset_work.
In reset_work, rather than calling the create_association directly,
schedule the connect work element to do its thing. After scheduling,
flush the connect work element to continue with semantic of not
returning until connect has been attempted at least once.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_fc_error_recovery() special cases handling when in CONNECTING state
and calls __nvme_fc_terminate_io(). __nvme_fc_terminate_io() itself
special cases CONNECTING state and calls the routine to abort outstanding
ios.
Simplify the sequence by putting the call to abort outstanding I/Os
directly in nvme_fc_error_recovery.
Move the location of __nvme_fc_abort_outstanding_ios(), and
nvme_fc_terminate_exchange() which is called by it, to avoid adding
function prototypes for nvme_fc_error_recovery().
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
err_work was created to handle errors (mainly I/O timeouts) while in
CONNECTING state. The flag for err_work_active is also unneeded.
Remove err_work_active and err_work. The actions to abort I/Os are moved
inline to nvme_error_recovery().
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Whenever there are errors during CONNECTING, the driver recovers by
aborting all outstanding ios and counts on the io completion to fail them
and thus the connection/association they are on. However, the connection
failure depends on a failure state from the core routines. Not all
commands that are issued by the core routine are guaranteed to cause a
failure of the core routine. They may be treated as a failure status and
the status is then ignored.
As such, whenever the transport enters error_recovery while CONNECTING,
it will set a new flag indicating an association failed. The
create_association routine which creates and initializes the controller,
will monitor the state of the flag as well as the core routine error
status and ensure the association fails if there was an error.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We've had several complaints about a 10s reconnect delay (the default)
when there was an error while there is connectivity to a subsystem.
The max_reconnects and reconnect_delay are set in common code prior to
calling the transport to create the controller.
This change checks if the default reconnect delay is being used, and if
so, it adjusts it to a shorter period (2s) for the nvme-fc transport.
It does so by calculating the controller loss tmo window, changing the
value of the reconnect delay, and then recalculating the maximum number
of reconnect attempts allowed.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
On reconnect, the code currently does not freeze the controller before
possibly updating the number hw queues for the controller.
Add the freeze before updating the number of hw queues. Note: the queues
are already started and remain started through the reconnect.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The loop that backs out of hw io queue creation continues through index
0, which corresponds to the admin queue as well.
Fix the loop so it only proceeds through indexes 1..n which correspond to
I/O queues.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, an I/O timeout unconditionally invokes
nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If
live, the routine resets the controller which initiates a reconnect -
which is valid. If CONNECTING, err_work is scheduled. Err_work then
calls the terminate_io routine, which also checks for CONNECTING and
noops any further action on outstanding I/O. The result is nothing
happened to the timed out io. As such, if the command was dropped on
the wire, it will never timeout / complete, and the connect process
will hang.
Change the behavior of the io timeout routine to unconditionally abort
the I/O. I/O completion handling will note that an io failed due to an
abort and will terminate the connection / association as needed. If the
abort was unable to happen, continue with a call to
nvme_fc_error_recovery(). To ensure something different happens in
nvme_fc_error_recovery() rework it so at it will abort all I/Os on the
association to force a failure.
As I/O aborts now may occur outside of delete_association, counting for
completion must be wary and only count those aborted during
delete_association when TERMIO is set on the controller.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The lldd may have made calls to delete a remote port or local port and
the delete is in progress when the cli then attempts to create a new
controller. Currently, this proceeds without error although it can't be
very successful.
Fix this by validating that both the host port and remote port are
present when a new controller is to be created.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cancel async event work in case async event has been queued up, and
nvme_fc_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn <dmilburn@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API. It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On an error exit path, a negative error code should be returned
instead of a positive return value.
Fixes: e399441de9 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart <jsmart2021@gmail.com>
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the FC transport is set max_hw_sectors based on the lldds
max sgl segment count. However, the block queue max segments is
set based on the controller's max_segments count, which the transport
does not set. As such, the lldd is receiving sgl lists that are
exceeding its max segment count.
Set the controller max segment count and derive max_hw_sectors from
the max segment count.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
is an optimized path to it, path nvme1 happens to be
inaccessible.
2) Before scan_work is complete nvme0 disconnect is initiated
nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
3) scan_work(1) attempts to submit IO,
but nvme_path_is_optimized() observes nvme0 is not LIVE.
Since nvme1 is a possible path IO is requeued and scan_work hangs.
--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: efi_partition+0x1e6/0x708
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
--
4) Delete also hangs in flush_work(ctrl->scan_work)
from nvme_remove_namespaces().
Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.
[ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[ 605.552087] Call Trace:
[ 605.552683] __schedule+0x2b9/0x6c0
[ 605.553507] schedule+0x42/0xb0
[ 605.554201] io_schedule+0x16/0x40
[ 605.555012] do_read_cache_page+0x438/0x830
[ 605.556925] read_cache_page+0x12/0x20
[ 605.557757] read_dev_sector+0x27/0xc0
[ 605.558587] amiga_partition+0x4d/0x4c5
[ 605.561278] check_partition+0x154/0x244
[ 605.562138] rescan_partitions+0xae/0x280
[ 605.563076] __blkdev_get+0x40f/0x560
[ 605.563830] blkdev_get+0x3d/0x140
[ 605.564500] __device_add_disk+0x388/0x480
[ 605.565316] device_add_disk+0x13/0x20
[ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core]
[ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core]
[ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core]
[ 605.573330] process_one_work+0x1db/0x380
[ 605.574144] worker_thread+0x4d/0x400
[ 605.574896] kthread+0x104/0x140
[ 605.577205] ret_from_fork+0x35/0x40
[ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830
[ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.582320] nvme D 0 14044 14043 0x00000000
[ 605.583424] Call Trace:
[ 605.583935] __schedule+0x2b9/0x6c0
[ 605.584625] schedule+0x42/0xb0
[ 605.585290] schedule_timeout+0x203/0x2f0
[ 605.588493] wait_for_completion+0xb1/0x120
[ 605.590066] __flush_work+0x123/0x1d0
[ 605.591758] __cancel_work_timer+0x10e/0x190
[ 605.593542] cancel_work_sync+0x10/0x20
[ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core]
[ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core]
[ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[ 605.598320] dev_attr_store+0x17/0x30
Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.
In addition, change the transport drivers to follow the updated state
machine.
Fixes: 0d0b660f21 ("nvme: add ANA support")
Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Asynchronous event notifications do not have an associated request.
When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which
leads to a crash.
Fixes: 16686f3a6c ("nvme: move common call to nvme_cleanup_cmd to core layer")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The nvme-fc devloss_tmo is computed as the min of either the
ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
value. So correct this by printing the min value instead.
Signed-off-by: Martin George <marting@netapp.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When CONFIG_ARCH_NO_SG_CHAIN is set, op->sgl[0] cannot be dereferenced,
as gcc-10 now points out:
drivers/nvme/host/fc.c: In function 'nvme_fc_init_request':
drivers/nvme/host/fc.c:1774:29: warning: array subscript 0 is outside the bounds of an interior zero-length array 'struct scatterlist[0]' [-Wzero-length-bounds]
1774 | op->op.fcp_req.first_sgl = &op->sgl[0];
| ^~~~~~~~~~~
drivers/nvme/host/fc.c:98:21: note: while referencing 'sgl'
98 | struct scatterlist sgl[NVME_INLINE_SG_CNT];
| ^~~
I don't know if this is a legitimate warning or a false-positive.
If this is just a false alarm, the warning is easily suppressed
by interpreting the array as a pointer.
Fixes: b1ae1a2389 ("nvme-fc: Avoid preallocating big SGL for data")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme-fc host transport did not support the reception of a
FC-NVME LS. Reception is necessary to implement full compliance
with FC-NVME-2.
Populate the LS receive handler, and specifically the handling
of a Disconnect Association LS. The response to the LS, if it
matched a controller, must be sent after the aborts for any
I/O on any connection have been sent.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given that both host and target now generate and receive LS's create
a single table definition for LS names. Each tranport half will have
a local version of the table.
As Create Association LS is issued by both sides, and received by
both sides, create common routines to format the LS and to validate
the LS.
Convert the host side transport to use the new common Create
Association LS formatting routine.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the assoc_active boolean flag to a bitop on the flags field.
The bit ops will provide atomicity.
To make this change, the flags field was converted to a long type,
which also affects the FCCTRL_TERMIO flag. Both FCCTRL_TERMIO and
now ASSOC_ACTIVE flags are set/cleared by bit operations.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that when allocations are done, and the lldd options indicate
no private data is needed, that private pointers will be set to NULL
(catches driver error that forgot to set private data size).
Slightly reorg the allocations so that private data follows allocations
for LS request/response buffers. Ensures better alignments for the buffers
as well as the private pointer.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Routines in the target will want to be used in the host as well.
Error definitions should now shared as both sides will process
requests and responses to requests.
Moved common declarations to new fc.h header kept in the host
subdirectory.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current LLDD api has:
nvme-fc: contains api for transport to do LS requests (and aborts of
them). However, there is no interface for reception of LS's and sending
responses for them.
nvmet-fc: contains api for transport to do reception of LS's and sending
of responses for them. However, there is no interface for doing LS
requests.
Revise the api's so that both nvme-fc and nvmet-fc can send LS's, as well
as receiving LS's and sending their responses.
Change name of the rcv_ls_req struct to better reflect generic use as
a context to used to send an ls rsp. Specifically:
nvmefc_tgt_ls_req -> nvmefc_ls_rsp
nvmefc_tgt_ls_req.nvmet_fc_private -> nvmefc_ls_rsp.nvme_fc_private
Change nvmet_fc_rcv_ls_req() calling sequence to provide handle that
can be used by transport in later LS request sequences for an association.
nvme-fc nvmet_fc nvme_fcloop:
Revise to adapt to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Add stubs for new interfaces:
host/fc.c: nvme_fc_rcv_ls_req()
target/fc.c: nvmet_fc_invalidate_host()
lpfc:
Revise to adapt code to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The original patch was to resolve the lldd being able to be unloaded
while being used to talk to the boot device of the system. However, the
end result of the original patch is that any driver unload while a nvme
controller is live via the lldd is now being prohibited. Given the module
reference, the module teardown routine can't be called, thus there's no
way, other than manual actions to terminate the controllers.
Fixes: 863fbae929 ("nvme_fc: add module to ops template to allow module references")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>