Граф коммитов

1014105 Коммитов

Автор SHA1 Сообщение Дата
Geert Uytterhoeven d399742cd0 nvme: fix grammar in the CONFIG_NVME_MULTIPATH kconfig help text
Fix a singular/plural mismatch in the CONFIG_NVME_MULTIPATH help text.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-16 05:36:15 +02:00
Daniel Wagner 2411424143 nvme: remove superfluous bio_set_dev in nvme_requeue_work
Commit ce86dad222 ("nvme-multipath: reset bdev to ns head when
failover") moved the reset code where the bio is added to the
requeue_list for the failover path. But it left the original
bio_set_dev in nvme_requeue_work.

There is a second path to nvme_requee_work. It is via
nvme_ns_head_submit_bio. Though we don't have to set bio->bi_bdev for
this path either, as it points to the correct bdev already.

Let's remove the bio_set_dev. It's updating the bio->bi_bdev with the
same pointer and thus it's unnecessary.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-16 05:36:15 +02:00
Daniel Wagner 120bb3624d nvme: verify MNAN value if ANA is enabled
The controller is required to have a non-zero MNAN value if it supports
ANA:

   If the controller supports Asymmetric Namespace Access Reporting, then
   this field shall be set to a non-zero value that is less than or equal
   to the NN value.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-16 05:36:15 +02:00
Mario Limonciello 6485fc18fa ACPI: Add quirks for AMD Renoir/Lucienne CPUs to force the D3 hint
AMD systems from Renoir and Lucienne require that the NVME controller
is put into D3 over a Modern Standby / suspend-to-idle
cycle.  This is "typically" accomplished using the `StorageD3Enable`
property in the _DSD, but this property was introduced after many
of these systems launched and most OEM systems don't have it in
their BIOS.

On AMD Renoir without these drives going into D3 over suspend-to-idle
the resume will fail with the NVME controller being reset and a trace
like this in the kernel logs:
```
[   83.556118] nvme nvme0: I/O 161 QID 2 timeout, aborting
[   83.556178] nvme nvme0: I/O 162 QID 2 timeout, aborting
[   83.556187] nvme nvme0: I/O 163 QID 2 timeout, aborting
[   83.556196] nvme nvme0: I/O 164 QID 2 timeout, aborting
[   95.332114] nvme nvme0: I/O 25 QID 0 timeout, reset controller
[   95.332843] nvme nvme0: Abort status: 0x371
[   95.332852] nvme nvme0: Abort status: 0x371
[   95.332856] nvme nvme0: Abort status: 0x371
[   95.332859] nvme nvme0: Abort status: 0x371
[   95.332909] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -16
[   95.332936] nvme 0000:03:00.0: PM: failed to resume async: error -16
```

The Microsoft documentation for StorageD3Enable mentioned that Windows has
a hardcoded allowlist for D3 support, which was used for these platforms.
Introduce quirks to hardcode them for Linux as well.

As this property is now "standardized", OEM systems using AMD Cezanne and
newer APU's have adopted this property, and quirks like this should not be
necessary.

CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
CC: Alexander Deucher <Alexander.Deucher@amd.com>
CC: Prike Liang <prike.liang@amd.com>
Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Julian Sikorski <belegdol@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-16 05:36:13 +02:00
Mario Limonciello 2744d7a073 ACPI: Check StorageD3Enable _DSD property in ACPI code
Although first implemented for NVME, this check may be usable by
other drivers as well. Microsoft's specification explicitly mentions
that is may be usable by SATA and AHCI devices.  Google also indicates
that they have used this with SDHCI in a downstream kernel tree that
a user can plug a storage device into.

Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro
Suggested-by: Keith Busch <kbusch@kernel.org>
CC: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
CC: Alexander Deucher <Alexander.Deucher@amd.com>
CC: Rafael J. Wysocki <rjw@rjwysocki.net>
CC: Prike Liang <prike.liang@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-16 05:14:59 +02:00
Jens Axboe e0d245e223 Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.14/drivers
Pull MD changes from Song:

"1) iostats rewrite by Guoqing Jiang;
 2) raid5 lock contention optimization by Gal Ofri."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid5: avoid device_lock in read_one_chunk()
  md: add comments in md_integrity_register
  md: check level before create and exit io_acct_set
  md: Constify attribute_group structs
  md: mark some personalities as deprecated
  md/raid10: enable io accounting
  md/raid1: enable io accounting
  md/raid1: rename print_msg with r1bio_existed
  md/raid5: avoid redundant bio clone in raid5_read_one_chunk
  md/raid5: move checking badblock before clone bio in raid5_read_one_chunk
  md: add io accounting for raid0 and raid5
  md: revert io stats accounting
2021-06-15 15:42:56 -06:00
Jens Axboe 491e5b170e Floppy patches for 5.14
Two oneliners to fix clang warnings:
 - -Wimplicit-fallthrough warning fix from Gustavo A. R. Silva.
 - Redundant assignment warning fix from Jiapeng Chong.
 
 No semantic and behavioural changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEdlQDNgKUDfGSD+QDtSKVsDNQMB8FAmDIS9ASHGVmcmVtb3ZA
 bGludXguY29tAAoJELUilbAzUDAftJYP/RTZ/gMH/41vBFj94UACCPwtexsEgIxe
 jU0GglPtKb/wTSMgQRs3EIsNTV/fWnZIoVtc4Mi+uT1jH7LFg+yYztSKfXL+OM4Y
 rZ/1Y68k0qH7ImX86Xi+He2Mpiqy9nr9wr4oJfWaawIunLhHrwyn7HwVsKhNqzRl
 grulrVXlTyypZPXOz+woa+waKLXFaYxgHjE4pV5qQRoRa72MIZ7gbbF9ch9f4o/P
 zCPmWztkHCJQt0xNYhY3XSBGLamxJuG3MtckTcQ2oQwzoJvsDvbXI+k+MbSjOOuE
 mXSjBQcmD+K/Z7AH6t5ax1nKt1gSorJOVVPCaorycxK62GqAjhomlGnB5f7HHLuj
 c0tbeWSkT4SwZjfD/sTKdJE9m3hR4IBxIOGIuP9b5dP/NIn/bKa+moqa6ydldQ8b
 XvHRzry5lryzXepg7hZbhwkfabGuBnmKaH6VgVxPUaPGF8cUO/zyq4F2TSXZMOee
 hMG+w1VcKPuwDj1IY1WfwXPv6VBhx69b5/0lqtcpztMCqSuZd8+rsWpCGuVFGc2i
 NOvRQHjPwuEMi9m9H4Jmw/ZS3IZ48J3XBd+Zsu+PJgcs+JUCQKme+9iybeW+lfSF
 5GV/CurlP05qD1LosMg1L2XnJMBd9LGs75tLgtadxHVyXubKN6VrGQ+AgCByTEI4
 rm605K2iARQJ
 =1rJH
 -----END PGP SIGNATURE-----

Merge tag 'floppy-for-5.14' of https://github.com/evdenis/linux-floppy into for-5.14/drivers

Pull floppy fixes from Denis:

"Floppy patches for 5.14

 Two oneliners to fix clang warnings:
 - -Wimplicit-fallthrough warning fix from Gustavo A. R. Silva.
 - Redundant assignment warning fix from Jiapeng Chong.

 No semantic and behavioural changes."

* tag 'floppy-for-5.14' of https://github.com/evdenis/linux-floppy:
  floppy: Fix fall-through warning for Clang
  floppy: cleanup: remove redundant assignment to nr_sectors
2021-06-15 15:41:17 -06:00
Gustavo A. R. Silva 2c9bdf6e47 floppy: Fix fall-through warning for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Link: https://lore.kernel.org/linux-hardening/47bcd36a-6524-348b-e802-0691d1b3c429@kernel.dk/
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Denis Efremov <efremov@linux.com>
2021-06-15 09:18:55 +03:00
Jiapeng Chong 30ab5db7ee floppy: cleanup: remove redundant assignment to nr_sectors
Variable nr_sectors is set to zero but this value is never
read as it is overwritten later on, hence it is a redundant
assignment and can be removed.

Clean up the following clang-analyzer warning:

drivers/block/floppy.c:2333:2: warning: Value stored to 'nr_sectors' is
never read [clang-analyzer-deadcode.DeadStores].

Link: https://lore.kernel.org/r/1619774805-121562-1-git-send-email-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Denis Efremov <efremov@linux.com>
2021-06-15 09:18:51 +03:00
Gal Ofri 97ae27252f md/raid5: avoid device_lock in read_one_chunk()
There is a lock contention on device_lock in read_one_chunk().
device_lock is taken to sync conf->active_aligned_reads and
conf->quiesce.
read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
before incrementing active_aligned_reads.
raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
for active_aligned_reads to be zero before setting quiesce=1
(suspended).

Introduce a fast (lockless) path in read_one_chunk(): activate aligned
read without taking device_lock.  In case quiesce starts while
activating the aligned-read in fast path, deactivate it and revert to
old behavior (take device_lock and wait for quiesce to finish).

Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
gaurantee that read_one_chunk() does not miss an ongoing quiesce.

My setups:
1. 8 local nvme drives (each up to 250k iops).
2. 8 ram disks (brd).

Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
socket) system. Record both iops and cpu spent on this contention with
rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
cpu is still busy but due to "new" bottlenecks.

nvme:
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 1.6M           | ~50% | 5.5GB/s
with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)

ram (brd):
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 2M             | ~80% | 24GB/s
with patch    | 4M             | 0%   | 55GB/s

CC: Song Liu <song@kernel.org>
CC: Neil Brown <neilb@suse.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Gal Ofri <gal.ofri@storing.io>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang de3ea66e9d md: add comments in md_integrity_register
Given it is not obvious for the error handling, let's try to add some
comments here to make it clear.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang daee202471 md: check level before create and exit io_acct_set
The bio_set (io_acct_set) is used by personalities to clone bio and
trace the timestamp of bio. Some personalities such as raid1/10 don't
need the bio_set, so add check to not create it unconditionally.

Also update the comment for md_account_bio to make it more clear.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Rikard Falkeborn c32dc04059 md: Constify attribute_group structs
The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang 608f52e30a md: mark some personalities as deprecated
Mark the three personalities (linear, fault and multipath) as deprecated
because:

1. people can use dm multipath or nvme multipath.
2. linear is already deprecated in MODULE_ALIAS.
3. no one actively using fault.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang 528bc2cf2f md/raid10: enable io accounting
For raid10, we record the start time between split bio and clone bio,
and finish the accounting in the final endio.

Also introduce start_time in r10bio accordingly.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang a0159832e5 md/raid1: enable io accounting
For raid1, we record the start time between split bio and clone bio,
and finish the accounting in the final endio.

Also introduce start_time in r1bio accordingly.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:07 -07:00
Guoqing Jiang 9b8ae7b938 md/raid1: rename print_msg with r1bio_existed
The caller of raid1_read_request could pass NULL or a valid pointer for
"struct r1bio *r1_bio", so it actually means whether r1_bio is existed
or not.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:06 -07:00
Guoqing Jiang 1147f58e10 md/raid5: avoid redundant bio clone in raid5_read_one_chunk
After enable io accounting, chunk read bio could be cloned twice which
is not good. To avoid such inefficiency, let's clone align_bio from
io_acct_set too, then we need only call md_account_bio in make_request
unconditionally.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:06 -07:00
Guoqing Jiang c82aa1b767 md/raid5: move checking badblock before clone bio in raid5_read_one_chunk
We don't need to clone bio if the relevant region has badblock.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:06 -07:00
Guoqing Jiang 10764815ff md: add io accounting for raid0 and raid5
We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
don't own clone infrastructure to accounting io. And the bioset is added
to mddev instead of to raid0 and raid5 layer, because with this way, we
can put common functions to md.h and reuse them in raid0 and raid5.

Also struct md_io_acct is added accordingly which includes io start_time,
the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
to get related io status.

Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:06 -07:00
Guoqing Jiang ad3fc79880 md: revert io stats accounting
The commit 41d2d848e5 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848e5 ("md: improve io stats accounting")
Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
2021-06-14 22:32:06 -07:00
Christoph Hellwig d07f3b081e mark pstore-blk as broken
pstore-blk just pokes directly into the pagecache for the block
device without going through the file operations for that by faking
up it's own file operations that do not match the block device ones.

As this breaks the control of the block layer of it's page cache,
and even now just works by accident only the best thing is to just
disable this driver.

Fixes: 17639f67c1 ("pstore/blk: Introduce backend for block devices")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210608161327.1537919-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:26:03 -06:00
Zhen Lei ec1e7e8853 z2ram: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:55 -06:00
Zhen Lei c744b06254 sx8: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:53 -06:00
Zhen Lei 6597efa6c5 sunvdc: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:52 -06:00
Zhen Lei ce9a8ca68a mtip32xx: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:51 -06:00
Zhen Lei 8404e19194 drbd: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:49 -06:00
Zhen Lei 76cdb09b38 aoe: remove unnecessary oom message
Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-09 09:41:48 -06:00
Jens Axboe 600abd3401 nvme updates for Linux 5.14
- improve the APST configuration algorithm (Alexey Bogoslavsky)
  - look for StorageD3Enable on companion ACPI device (Mario Limonciello)
  - allow selecting the network interface for TCP connections
    (Martin Belanger)
  - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King, me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmC/feMLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNddw/8CxvuW2rsPYp3ixhmCKP11e8o0mVeNjQBWp4ifeoN
 WutS/5e4mJzVdPIAzPyTh7CxCVXQLezJ2ucx9PKFxMvOPOCGAO/sq/QVeItoroLX
 ZH+hZWF9AVPBzKgyUoCQ25B+vCc0EeoMDvjraOHwlcgszJeW1WWbavuoLLfAbjjM
 xdG5TyNWijt0656Ru18NIYGqb3c2+ZZFC8ZMfoebH+uBDX2UI8QQxXfr7FErzt7s
 bjPWRYH629WOpqmi4ckFSFsffCBkBxrQM6Ef0rsOiSDo7nkvDTpUxNMvKMuMUD8T
 UmOTHwWtGD2/bXgA0u4R5xg9ky1g7md/lenoB1ReeG/rZVO76H1e6QRa8/W8i1U2
 ld6Ge9K8iqrtYP93+tvQOg7IbtMWHWDhUVma9J8RQprptA4e0790VQagyp26UKuh
 CHmCvhov+6uwSTr/w9HWE+LRnuRY9rgaLlwqn2itos07UhwRHS5mSeO9Eb/WtTfW
 dze8nPXUBsl/8jorsXQpZlKn6fAPJSSUlVnzXJecWDoWYwBuMeZMCNVEQGqZQYzg
 Y29STWKOQav2QHhJ8K7fTcWwx7fSE0JGOWY0fuF2okvhT069B14RSURRiXfYuJK2
 IMRddEdwrrFztvoYQDy8o8xksNLnpolBpT+SIE+Y80XZ8e5sohOXVbsaqP7YQhYb
 4W0=
 =OwBh
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.14-2021-06-08' of git://git.infradead.org/nvme into for-5.14/drivers

Pull NVMe updates from Christoph:

"nvme updates for Linux 5.14

 - improve the APST configuration algorithm (Alexey Bogoslavsky)
 - look for StorageD3Enable on companion ACPI device (Mario Limonciello)
 - allow selecting the network interface for TCP connections
   (Martin Belanger)
 - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King, me)"

* tag 'nvme-5.14-2021-06-08' of git://git.infradead.org/nvme:
  nvmet: remove a superfluous variable
  nvmet: move ka_work initialization to nvmet_alloc_ctrl
  nvme: remove nvme_{get,put}_ns_from_disk
  nvme: split nvme_report_zones
  nvme: move the CSI sanity check into nvme_ns_report_zones
  nvme: add a sparse annotation to nvme_ns_head_ctrl_ioctl
  nvme: open code nvme_put_ns_from_disk in nvme_ns_head_ctrl_ioctl
  nvme: open code nvme_{get,put}_ns_from_disk in nvme_ns_head_ioctl
  nvme: open code nvme_put_ns_from_disk in nvme_ns_head_chr_ioctl
  nvme-fabrics: remove extra braces
  nvme-fabrics: remove an extra comment
  nvme-fabrics: remove extra new lines in the switch
  nvme-fabrics: fix the kerneldco comment for nvmf_log_connect_error()
  nvme-tcp: allow selecting the network interface for connections
  nvme-pci: look for StorageD3Enable on companion ACPI device instead
  nvme: extend and modify the APST configuration algorithm
  nvme: remove redundant initialization of variable ret
2021-06-08 15:02:21 -06:00
Chaitanya Kulkarni 346ac785ba nvmet: remove a superfluous variable
Remove the superfluous variable "bdev" that is only used once in the
nvmet_bdev_alloc_bip() and use req->ns->bdev that is used everywhere in
the code to access the nvmet request's bdev.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:26 +03:00
Amit Engel f6e8bd59c4 nvmet: move ka_work initialization to nvmet_alloc_ctrl
Initialize keep-alive work only once, as part of alloc_ctrl
and not each time that nvmet_start_keep_alive_timer is being called

Signed-off-by: Amit Engel <amit.engel@dell.com>
Reviewed-by: Hou Pu <houpu.main@gmail.com>
2021-06-03 10:29:26 +03:00
Christoph Hellwig f1cf35e17e nvme: remove nvme_{get,put}_ns_from_disk
Now that only one caller is left remove the helpers by restructuring
nvme_pr_command so that it has two helpers for sending a command of to a
given nsid using either the ns_head for multipath, or the namespace
stored in the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:26 +03:00
Christoph Hellwig 8b4fb0f968 nvme: split nvme_report_zones
Split multipath support out of nvme_report_zones into a separate helper
and simplify the non-multipath version as a result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:26 +03:00
Christoph Hellwig d8ca66e821 nvme: move the CSI sanity check into nvme_ns_report_zones
Move the CSI check into nvme_ns_report_zones to clean up the code
a little bit and prepare for further refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:25 +03:00
Christoph Hellwig 85b790a7ae nvme: add a sparse annotation to nvme_ns_head_ctrl_ioctl
Add the __releases annotation to tell sparse that nvme_ns_head_ctrl_ioctl
is expected to unlock head->srcu.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:25 +03:00
Christoph Hellwig 3e7d1a5516 nvme: open code nvme_put_ns_from_disk in nvme_ns_head_ctrl_ioctl
nvme_ns_head_ctrl_ioctl is always used on multipath nodes, so just call
srcu_read_unlock directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:25 +03:00
Christoph Hellwig 86b4284d98 nvme: open code nvme_{get,put}_ns_from_disk in nvme_ns_head_ioctl
nvme_ns_head_ioctl is always used on multipath nodes, no need to
deal with the de-multiplexers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:25 +03:00
Christoph Hellwig f423c85cd3 nvme: open code nvme_put_ns_from_disk in nvme_ns_head_chr_ioctl
nvme_ns_head_chr_ioctl is always used on multipath nodes, so just call
srcu_read_unlock and consolidate the two unlock paths.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni 97ba6931ba nvme-fabrics: remove extra braces
No need to use the braces around ~ operator.

No functionality change in this patch.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni 6f860c9225 nvme-fabrics: remove an extra comment
Remove the comment at the end of the switch that is not needed as
function is small enough.

No functionality change in this patch.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni 63d20f54a3 nvme-fabrics: remove extra new lines in the switch
Remove the extra lines in the switch block that is not common practice
in the kernel code.

No functionality change in this patch.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:25 +03:00
Chaitanya Kulkarni 25e1de8c40 nvme-fabrics: fix the kerneldco comment for nvmf_log_connect_error()
Fix the comment style that matches existing code.

No functionality change in this patch.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Martin Belanger 3ede8f72a9 nvme-tcp: allow selecting the network interface for connections
In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.

Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.

This new option is needed in addtion to the existing host-traddr for
the following reasons:

Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 100.0.0.100/24 scope global lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s3
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s8
       valid_lft forever preferred_lft forever

The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.

The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.

$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
       valid_lft 426sec preferred_lft 426sec
    inet 192.168.56.102/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.103/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.104/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever

Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.

In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.

References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122

Tested both IPv4 and IPv6 connections.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Mario Limonciello e21e0243e7 nvme-pci: look for StorageD3Enable on companion ACPI device instead
The documentation around the StorageD3Enable property hints that it
should be made on the PCI device.  This is where newer AMD systems set
the property and it's required for S0i3 support.

So rather than look for nodes of the root port only present on Intel
systems, switch to the companion ACPI device for all systems.
David Box from Intel indicated this should work on Intel as well.

Link: https://lore.kernel.org/linux-nvme/YK6gmAWqaRmvpJXb@google.com/T/#m900552229fa455867ee29c33b854845fce80ba70
Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro
Fixes: df4f9bc4fb ("nvme-pci: add support for ACPI StorageD3Enable property")
Suggested-by: Liang Prike <Prike.Liang@amd.com>
Acked-by: Raul E Rangel <rrangel@chromium.org>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: David E. Box <david.e.box@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Alexey Bogoslavsky ebd8a93aa4 nvme: extend and modify the APST configuration algorithm
The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.

Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Colin Ian King 13ce7e625a nvme: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read,
it is being updated later on. The assignment is redundant and can be
removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Gustavo A. R. Silva 8184035805 rsxx: Use struct_size() in vmalloc()
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.

This code was detected with the help of Coccinelle and, audited and
fixed manually.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20210513203730.GA212128@embeddedor
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:30 -06:00
John Garry d97e594c51 blk-mq: Use request queue-wide tags for tagset-wide sbitmap
The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:22 -06:00
John Garry 56b68085e5 blk-mq: Some tag allocation code refactoring
The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:22 -06:00
Ming Lei 364b61818f blk-mq: clearing flush request reference in tags->rqs[]
Before we free request queue, clearing flush request reference in
tags->rqs[], so that potential UAF can be avoided.

Based on one patch written by David Jeffery.

Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:22 -06:00