The STEC S1220 PCIe SSD cards are EOL since 2014 and not supported by
the vendor anymore. As the skd driver for this SSD is starting to cause
problems with improvements to the block layer, stop supporting it in
newer kernel versions.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This issue was originally fixed in 09954bad4 ("floppy: refactor open()
flags handling").
The fix as a side-effect, however, introduce issue for open(O_ACCMODE)
that is being used for ioctl-only open. I wrote a fix for that, but
instead of it being merged, full revert of 09954bad4 was performed,
re-introducing the O_NDELAY / O_NONBLOCK issue, and it strikes again.
This is a forward-port of the original fix to current codebase; the
original submission had the changelog below:
====
Commit 09954bad4 ("floppy: refactor open() flags handling"), as a
side-effect, causes open(/dev/fdX, O_ACCMODE) to fail. It turns out that
this is being used setfdprm userspace for ioctl-only open().
Reintroduce back the original behavior wrt !(FMODE_READ|FMODE_WRITE)
modes, while still keeping the original O_NDELAY bug fixed.
Link: https://lore.kernel.org/r/nycvar.YFH.7.76.2101221209060.5622@cbobk.fhfr.pm
Cc: stable@vger.kernel.org
Reported-by: Wim Osterholt <wim@djo.tudelft.nl>
Tested-by: Wim Osterholt <wim@djo.tudelft.nl>
Reported-and-tested-by: Kurt Garloff <kurt@garloff.de>
Fixes: 09954bad4 ("floppy: refactor open() flags handling")
Fixes: f2791e7ead ("Revert "floppy: refactor open() flags handling"")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Denis Efremov <efremov@linux.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAUXQsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppO4EAClcqoneAuhT4UvRVNxblXPhPaoC69aNgXd
s+34uQSCqeWrWIAokfKp8bh3kyRqe00591auA7DwtwNqGpWuIECX8o9QvROEkuxv
0o4JFGMTHOJKP1W79Oy3RpF5oee6rMMOQN7EFL272p2xd8NRCP33c4fKvJRz+DDE
0kCcZhVjca0nZ+9OJC+WAlV+dit3azCAKSp7cItJsdOgZL74ZcGECm0pA8RpStyi
tQrUr2yiHLkm1lcOYfid0fG2/5a4vAGZQav+EshOWYw9UGeMquq/aqPuZZtEUjKe
oEECACfJ9cWErsi1CirIk5j5RKHOHmFSG3kRAmyvFB4f3YDGYxerI7eodWjNA0d5
38wW96sWuV4l0ShPmD3jGWIDTTcDZh4nEImCObf5YJFbr2fQXofWVWseIyo0zG8Y
zDa1N/M7XgkrScX8OF33NC1uv/oExhHA7jXuQN6mRBESYjcCrH2Lf6mXAA2C8u4T
z1RaG7ckRXGSbV3ol1ROrHj0RTXQ3zeIHj3yMRU8TKH0z6s+ob46D2PZCLi6cLvI
IuELhzKsS1EzMSVsYk9/AegynWFjVCRJoVUVxTsrxfGEF7attwmur3lOAjbZwSWb
jXlRbrkgBL1Pwbjg8AODEoq0jJgVM/S/3fG2rpcYLwwYC+FQ73/K+URmEuMsqkFC
GrYllTSMFg==
=hb7W
-----END PGP SIGNATURE-----
Merge tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"All over the place fixes for this release:
- blk-cgroup iteration teardown resched fix (Baolin)
- NVMe pull request from Christoph:
- add another Write Zeroes quirk (Chaitanya Kulkarni)
- handle a no path available corner case (Daniel Wagner)
- use the proper RCU aware list_add helper (Chao Leng)
- bcache regression fix (Coly)
- bdev->bd_size_lock IRQ fix. This will be fixed in drivers for 5.12,
but for now, we'll make it IRQ safe (Damien)
- null_blk zoned init fix (Damien)
- add_partition() error handling fix (Dinghao)
- s390 dasd kobject fix (Jan)
- nbd fix for freezing queue while adding connections (Josef)
- tag queueing regression fix (Ming)
- revert of a patch that inadvertently meant that we regressed write
performance on raid (Maxim)"
* tag 'block-5.11-2021-01-29' of git://git.kernel.dk/linux-block:
null_blk: cleanup zoned mode initialization
nvme-core: use list_add_tail_rcu instead of list_add_tail for nvme_init_ns_head
nvme-multipath: Early exit if no path is available
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a SPCC device
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
block: fix bd_size_lock use
blk-cgroup: Use cond_resched() when destroy blkgs
Revert "block: simplify set_init_blocksize" to regain lost performance
nbd: freeze the queue while we're adding connections
s390/dasd: Fix inconsistent kobject removal
block: Fix an error handling in add_partition
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
To avoid potential compilation problems, replaced the badly written
MB_TO_SECTS() macro (missing parenthesis around the argument use) with
the inline function mb_to_sects(). And while at it, simplify the
calculation of the total number of zones of the device using the
round_up() macro.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYBLX8QAKCRCAXGG7T9hj
vr2TAP4ylwxPVbf1l1V8zYCtCoNg087+Ubolr5kXXJkesG/nkgD6A2ix2oN1sC0Z
kbFBeZHqgP4AbVl7IhBALVFa1GPxWQg=
=NPGM
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A fix for a regression introduced in 5.11 resulting in Xen dom0
having problems to correctly initialize Xenstore.
- A fix for avoiding WARN splats when booting as Xen dom0 with
CONFIG_AMD_MEM_ENCRYPT enabled due to a missing trap handler for the
#VC exception (even if the handler should never be called).
- A fix for the Xen bklfront driver adapting to the correct but
unexpected behavior of new qemu.
* tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
xen: Fix XenStore initialisation for XS_LOCAL
xen-blkfront: allow discard-* nodes to be optional
Open code drbd_req_make_private_bio in the two callers to prepare
for further changes. Also don't bother to initialize bi_next as the
bio code already does that that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given that drbd_md_io_bio_set is initialized during module initialization
and the module fails to load if the initialization fails there is no need
to fall back to plain bio_alloc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix below warnings reported by coccicheck:
./drivers/block/rsxx/dma.c:948:3-8: WARNING: NULL check
before some freeing functions is not needed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
fixed the below warning:
/drivers/block/zram/zram_drv.c:534:2-8: WARNING: NULL check
before some freeing functions is not needed.
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can remove start_jif since it is not used by drbd_request_prepare,
then remove it from __drbd_make_request further.
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: drbd-dev@lists.linbit.com
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace pci_read_config_word() with pcie_capability_read_word().
pcie_capability_read_word() takes care of a few special cases when reading
the PCIe capability. See 8c0d3a02c1 ("PCI: Add accessors for PCI Express
Capability").
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use PCI #defines for PCIe Device Control register values instead of
hard-coding bit positions. No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, loop device has only one global lock: loop_ctl_mutex.
This becomes hot in scenarios where many loop devices are used.
Scale it by introducing per-device lock: lo_mutex that protects
modifications of all fields in struct loop_device.
Keep loop_ctl_mutex to protect global data: loop_index_idr, loop_lookup,
loop_add.
The new lock ordering requirement is that loop_ctl_mutex must be taken
before lo_mutex.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is inline with the specification described in blkif.h:
* discard-granularity: should be set to the physical block size if
node is not present.
* discard-alignment, discard-secure: should be set to 0 if node not
present.
This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.
Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.
Fixes: ed30bf317c ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-By: Arthur Borsboom <arthurborsboom@gmail.com>
Link: https://lore.kernel.org/r/20210119105727.95173-1-roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
When setting up a device, we can krealloc the config->socks array to add
new sockets to the configuration. However if we happen to get a IO
request in at this point even though we aren't setup we could hit a UAF,
as we deref config->socks without any locking, assuming that the
configuration was setup already and that ->socks is safe to access it as
we have a reference on the configuration.
But there's nothing really preventing IO from occurring at this point of
the device setup, we don't want to incur the overhead of a lock to
access ->socks when it will never change while the device is running.
To fix this UAF scenario simply freeze the queue if we are adding
sockets. This will protect us from this particular case without adding
any additional overhead for the normal running case.
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer already checks for this conditions in bio_check_eod
before calling the driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the hardware read-only state can't be changed by the BLKROSET
ioctl, the code in this method is not required anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We had kernel panic, it is caused by unload module and last
close confirmation.
call trace:
[1196029.743127] free_sess+0x15/0x50 [rtrs_client]
[1196029.743128] rtrs_clt_close+0x4c/0x70 [rtrs_client]
[1196029.743129] ? rnbd_clt_unmap_device+0x1b0/0x1b0 [rnbd_client]
[1196029.743130] close_rtrs+0x25/0x50 [rnbd_client]
[1196029.743131] rnbd_client_exit+0x93/0xb99 [rnbd_client]
[1196029.743132] __x64_sys_delete_module+0x190/0x260
And in the crashdump confirmation kworker is also running.
PID: 6943 TASK: ffff9e2ac8098000 CPU: 4 COMMAND: "kworker/4:2"
#0 [ffffb206cf337c30] __schedule at ffffffff9f93f891
#1 [ffffb206cf337cc8] schedule at ffffffff9f93fe98
#2 [ffffb206cf337cd0] schedule_timeout at ffffffff9f943938
#3 [ffffb206cf337d50] wait_for_completion at ffffffff9f9410a7
#4 [ffffb206cf337da0] __flush_work at ffffffff9f08ce0e
#5 [ffffb206cf337e20] rtrs_clt_close_conns at ffffffffc0d5f668 [rtrs_client]
#6 [ffffb206cf337e48] rtrs_clt_close at ffffffffc0d5f801 [rtrs_client]
#7 [ffffb206cf337e68] close_rtrs at ffffffffc0d26255 [rnbd_client]
#8 [ffffb206cf337e78] free_sess at ffffffffc0d262ad [rnbd_client]
#9 [ffffb206cf337e88] rnbd_clt_put_dev at ffffffffc0d266a7 [rnbd_client]
The problem is both code path try to close same session, which lead to
panic.
To fix it, just skip the sess if the refcount already drop to 0.
Fixes: f7a7a5c228 ("block/rnbd: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adding name to the Contributors List
Signed-off-by: Swapnil Ingle <ingleswapnil@gmail.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Acked-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
lkp reboot following build error:
drivers/block/rnbd/rnbd-clt.c: In function 'rnbd_softirq_done_fn':
>> drivers/block/rnbd/rnbd-clt.c:387:2: error: implicit declaration of function 'sg_free_table_chained' [-Werror=implicit-function-declaration]
387 | sg_free_table_chained(&iu->sgt, RNBD_INLINE_SG_CNT);
| ^~~~~~~~~~~~~~~~~~~~~
The reason is CONFIG_SG_POOL is not enabled in the config, to
avoid such failure, select SG_POOL in Kconfig for RNBD_CLIENT.
Fixes: 5a1328d0c3 ("block/rnbd-clt: Dynamically allocate sglist for rnbd_iu")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Without crc32, the driver fails to link:
arm-linux-gnueabi-ld: drivers/block/rsxx/config.o: in function `rsxx_load_config':
config.c:(.text+0x124): undefined reference to `crc32_le'
Fixes: 8722ff8cdb ("block: IBM RamSan 70/80 device driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/kHXgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkhFEADYuiBZbYEonaV4/nqOhMZ6lXj99rEqVZui
AOMm7W8nopb97pWy0sJxZHPPMnjglubkTZbX/2TH08ndppGBQAa9HsgDIITO5Ap3
vjnew6oPsrMtxMUMqR8w8nMb4z5tfqpvUYRPd+qpvYSYLEeSh0UNAEdQ/MOGbA1t
nl8UPdNW/s1MbeyLJhU+NgGM0aZahED8KuJeVLOY2im6dySO+CoeoB0/mWrdc0PZ
SDcGBEjhmFspDVAkW4Wo8bMw7Cr72es0esvJSJyx0mlo0jSCR7ZYParDtPwuAl9H
thKo+brLibz9M+wRtW+7w37oUADTTRL+KV2xJ/J+DeAVjI7/hxgZKewh6hEORmrF
DwjbS8cnEVp70rfF7+Z9FV/+2GJXm1uI/swBi69Y42CQ33FNLZmikBW6Q3pimrDj
ZeQSCLRpGlo01xtJpsm8L71KGSfYd+njWaIFnESPhxJ+sTxd8kvFRsdjHlc4KNBG
UnMzvn6pE3WNhzJJoIqdM2uXa4XDyKkMfO7VbpUgbyij103jpbs7ruWXq3DDU5/t
Gdwa821zlq5azDYAw7PNb7FhKrsHhiKhOuxyKqJUbOUkOHBNwxJmxXcaMnR3fOSi
B+AlqYhC6A9DhkL0HG0QLcdFwYawpOsDxbFUbemt/UXn74lAjRnp8+rNMSn5tjbn
OCnk4Tpaww==
=Q42d
-----END PGP SIGNATURE-----
Merge tag 'block-5.11-2020-12-23' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few stragglers in here, but mostly just straight fixes. In
particular:
- Set of rnbd fixes for issues around changes for the merge window
(Gioh, Jack, Md Haris Iqbal)
- iocost tracepoint addition (Baolin)
- Copyright/maintainers update (Christoph)
- Remove old blk-mq fast path CPU warning (Daniel)
- loop max_part fix (Josh)
- Remote IPI threaded IRQ fix (Sebastian)
- dasd stable fixes (Stefan)
- bcache merge window fixup and style fixup (Yi, Zheng)"
* tag 'block-5.11-2020-12-23' of git://git.kernel.dk/linux-block:
md/bcache: convert comma to semicolon
bcache:remove a superfluous check in register_bcache
block: update some copyrights
block: remove a pointless self-reference in block_dev.c
MAINTAINERS: add fs/block_dev.c to the block section
blk-mq: Don't complete on a remote CPU in force threaded mode
s390/dasd: fix list corruption of lcu list
s390/dasd: fix list corruption of pavgroup group list
s390/dasd: prevent inconsistent LCU device data
s390/dasd: fix hanging device offline processing
blk-iocost: Add iocg idle state tracepoint
nbd: Respect max_part for all partition scans
block/rnbd-clt: Does not request pdu to rtrs-clt
block/rnbd-clt: Dynamically allocate sglist for rnbd_iu
block/rnbd: Set write-back cache and fua same to the target device
block/rnbd: Fix typos
block/rnbd-srv: Protect dev session sysfs removal
block/rnbd-clt: Fix possible memleak
block/rnbd-clt: Get rid of warning regarding size argument in strlcpy
blk-mq: Remove 'running from the wrong CPU' warning
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX92dmQAKCRCAXGG7T9hj
vqfiAQCWM8eVftAiTxuQGzdYHYBl5p57UtOc3n0Ap8cHbl9SOwEAqlLmQEopnbHT
P4WzH/PegsCU5BopSb+NgXkBesMtPgk=
=Yjsy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
"Some minor cleanup patches and a small series disentangling some Xen
related Kconfig options"
* tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Kconfig: remove X86_64 depends from XEN_512GB
xen/manage: Fix fall-through warnings for Clang
xen-blkfront: Fix fall-through warnings for Clang
xen: remove trailing semicolon in macro definition
xen: Kconfig: nest Xen guest options
xen: Remove Xen PVH/PVHVM dependency on PCI
x86/xen: Convert to DEFINE_SHOW_ATTRIBUTE
- Switch to the generic C VDSO, as well as some cleanups of our VDSO
setup/handling code.
- Support for KUAP (Kernel User Access Prevention) on systems using the hashed
page table MMU, using memory protection keys.
- Better handling of PowerVM SMT8 systems where all threads of a core do not
share an L2, allowing the scheduler to make better scheduling decisions.
- Further improvements to our machine check handling.
- Show registers when unwinding interrupt frames during stack traces.
- Improvements to our pseries (PowerVM) partition migration code.
- Several series from Christophe refactoring and cleaning up various parts of
the 32-bit code.
- Other smaller features, fixes & cleanups.
Thanks to:
Alan Modra, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Ard
Biesheuvel, Athira Rajeev, Balamuruhan S, Bill Wendling, Cédric Le Goater,
Christophe Leroy, Christophe Lombard, Colin Ian King, Daniel Axtens, David
Hildenbrand, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geert
Uytterhoeven, Giuseppe Sacco, Greg Kurz, Harish, Jan Kratochvil, Jordan
Niethe, Kaixu Xia, Laurent Dufour, Leonardo Bras, Madhavan Srinivasan, Mahesh
Salgaonkar, Mathieu Desnoyers, Nathan Lynch, Nicholas Piggin, Oleg Nesterov,
Oliver O'Halloran, Oscar Salvador, Po-Hsu Lin, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sandipan Das, Sebastian Andrzej Siewior ,
Segher Boessenkool, Srikar Dronamraju, Tyrel Datwyler, Uwe Kleine-König,
Vincent Stehlé, Youling Tang, Zhang Xiaoxu.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl/bURITHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgEzBEAC1Vwibcog2P9rkJPb0q3UGWSYSx25V
h/LwwxtM9Tm14j/LZsSgkOgIsfMaWEBIw/8D4efQ7AX9aFo+R0c2DdQMx1MG5MXz
gZk58+l3LwId6h9+OrwurpEW+ZmURLAtGMSyFdkeiZ3/XTnkbf1XnewC0QWQe56a
EGLmjx1MFl45jspoy7UIUXsXoNJIfflEKhrgUzSUh8X2eLmvB9ws6A4BXxbVzyZl
lZv3+uWimU2pFgdkB9jOCxoG4zFEr2o5ovLHG7zCCVo5JoXmTPQ5cMVBraH206ms
+5vCmu4qI8uP5UlZW/mZfhrtDiMdHdQqqFOaQwVlOmoUbU6L6E6rxm3iVnov2Bbi
iUgxoeJDxAb2cM2EWFK6oWVgr7+NkwvXM1IG8xtprhHrCdnC9r+psQr/dswb3LSg
MJ7u/RCq3uixy2kWP8E0NEHw7ngQZ/ZKPqzfnmIWOC7tYUxgaL02I8Ff9/ZXAI2J
CnmqFYOjrimHkcwXGOtKkXNvfU0DiL97qpK2AQNWElE8+bUUmpw+ltUrsdSycYmv
Afc4WIcVrTA+a9laSLgjdZbolbNSa3p+cMIYdPrVx9g+xqygbxIKv+EDGNv1WHfD
GU1gmohMY+ZkMOvFRMi8LAsEm0DH/etWE0py/8uyxDYKnGyD1Ur6452DStkmgGb2
azmcaOyLdb+HXA==
=Ga3K
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Switch to the generic C VDSO, as well as some cleanups of our VDSO
setup/handling code.
- Support for KUAP (Kernel User Access Prevention) on systems using the
hashed page table MMU, using memory protection keys.
- Better handling of PowerVM SMT8 systems where all threads of a core
do not share an L2, allowing the scheduler to make better scheduling
decisions.
- Further improvements to our machine check handling.
- Show registers when unwinding interrupt frames during stack traces.
- Improvements to our pseries (PowerVM) partition migration code.
- Several series from Christophe refactoring and cleaning up various
parts of the 32-bit code.
- Other smaller features, fixes & cleanups.
Thanks to: Alan Modra, Alexey Kardashevskiy, Andrew Donnellan, Aneesh
Kumar K.V, Ard Biesheuvel, Athira Rajeev, Balamuruhan S, Bill Wendling,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King,
Daniel Axtens, David Hildenbrand, Frederic Barrat, Ganesh Goudar,
Gautham R. Shenoy, Geert Uytterhoeven, Giuseppe Sacco, Greg Kurz,
Harish, Jan Kratochvil, Jordan Niethe, Kaixu Xia, Laurent Dufour,
Leonardo Bras, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
Desnoyers, Nathan Lynch, Nicholas Piggin, Oleg Nesterov, Oliver
O'Halloran, Oscar Salvador, Po-Hsu Lin, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sandipan Das, Sebastian Andrzej
Siewior , Segher Boessenkool, Srikar Dronamraju, Tyrel Datwyler, Uwe
Kleine-König, Vincent Stehlé, Youling Tang, and Zhang Xiaoxu.
* tag 'powerpc-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (304 commits)
powerpc/32s: Fix cleanup_cpu_mmu_context() compile bug
powerpc: Add config fragment for disabling -Werror
powerpc/configs: Add ppc64le_allnoconfig target
powerpc/powernv: Rate limit opal-elog read failure message
powerpc/pseries/memhotplug: Quieten some DLPAR operations
powerpc/ps3: use dma_mapping_error()
powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
powerpc/perf: Fix Threshold Event Counter Multiplier width for P10
powerpc/mm: Fix hugetlb_free_pmd_range() and hugetlb_free_pud_range()
KVM: PPC: Book3S HV: Fix mask size for emulated msgsndp
KVM: PPC: fix comparison to bool warning
KVM: PPC: Book3S: Assign boolean values to a bool variable
powerpc: Inline setup_kup()
powerpc/64s: Mark the kuap/kuep functions non __init
KVM: PPC: Book3S HV: XIVE: Add a comment regarding VP numbering
powerpc/xive: Improve error reporting of OPAL calls
powerpc/xive: Simplify xive_do_source_eoi()
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_EOI_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_MASK_FW
powerpc/xive: Remove P9 DD1 flag XIVE_IRQ_FLAG_SHIFT_BUG
...
adds the option of full in-transit encryption using AES-GCM algorithm
(myself). On top of that we have a series to avoid intermittent
errors during recovery with recover_session=clean and some MDS request
encoding work from Jeff, a cap handling fix and assorted observability
improvements from Luis and Xiubo and a good number of cleanups. Luis
also ran into a corner case with quotas which sadly means that we are
back to denying cross-quota-realm renames.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl/beWITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4i0CACnvd87l2n7dndig7p5d5lVsmo8tAFs
wHYHaIVisWKMcqKoT+YJajSgzaonxjzvYiyCzwLxV7s7vI7cswAwjEfYT7tTDRp2
pnO1+4N/1ftznnTk/1QdqwOQLUg5UtdgWvFCaXQF+Vr/YroZomKJPaK8fXK882pC
9FBjoLNy1HWySsoXPCxJktmDzpEEyYRNJg0vquxm7mxwTgQErupWlwEFjNg5LBkm
gC0UoKhCE3DeUrXnoq21Ga62RIajxHofTooNx7dg+JiSVgluW+nORaWDYJXNzwLC
j5puSe4pWIah+gmcwIFuyNz4ddkvVL4URvsYPGkVFYXlEefQjErc10Jh
=6b9f
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
(myself).
On top of that we have a series to avoid intermittent errors during
recovery with recover_session=clean and some MDS request encoding work
from Jeff, a cap handling fix and assorted observability improvements
from Luis and Xiubo and a good number of cleanups.
Luis also ran into a corner case with quotas which sadly means that we
are back to denying cross-quota-realm renames"
* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
libceph: drop ceph_auth_{create,update}_authorizer()
libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
libceph: introduce connection modes and ms_mode option
libceph, rbd: ignore addr->type while comparing in some cases
libceph, ceph: get and handle cluster maps with addrvecs
libceph: factor out finish_auth()
libceph: drop ac->ops->name field
libceph: amend cephx init_protocol() and build_request()
libceph, ceph: incorporate nautilus cephx changes
libceph: safer en/decoding of cephx requests and replies
libceph: more insight into ticket expiry and invalidation
libceph: move msgr1 protocol specific fields to its own struct
libceph: move msgr1 protocol implementation to its own file
libceph: separate msgr1 protocol implementation
libceph: export remaining protocol independent infrastructure
libceph: export zero_page
libceph: rename and export con->flags bits
libceph: rename and export con->state states
libceph: make con->state an int
...
The creation path of the NBD device respects max_part and only scans for
partitions if max_part is not 0. However, some other code paths ignore
max_part, and unconditionally scan for partitions. Add a check for
max_part on each partition scan.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously the rnbd client requested the rtrs to allocate rnbd_iu
just after the rtrs_iu. So the rnbd client passes the size of
rnbd_iu for rtrs_clt_open() and rtrs creates an array of
rnbd_iu and rtrs_iu.
For IO handling, rnbd_iu exists after the request because we pass
the size of rnbd_iu when setting the tag-set. Therefore we do not
use the rnbd_iu allocated by rtrs for IO handling.
We only use the rnbd_iu allocated by rtrs when doing session
initialization. Almost all rnbd_iu allocated by rtrs are wasted.
By this patch the rnbd client does not request rnbd_iu allocation
to rtrs but allocate it for itself when doing session initialization.
Also remove unused rtrs_permit_to_pdu from rtrs.
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BMAX_SEGMENT static array for scatterlist is embedded in
rnbd_iu structure to avoid memory allocation in hot IO path.
In many cases, we do need only several sg entries because many IOs
have only several segments.
This patch change rnbd_iu to check the number of segments in the request
and allocate sglist dynamically.
For io path, use sg_alloc_table_chained to allocate sg list faster.
First it makes two sg entries after pdu of request.
The sg_alloc_table_chained uses the pre-allocated sg entries
if the number of segments of the request is less than two.
So it reduces the number of memory allocation.
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rnbd-client always sets the write-back cache and fua attributes
of the rnbd device queue regardless of the target device on the server.
That generates IO hang issue when the target device does not
support both of write-back cacne and fua.
This patch adds more fields for the cache policy and fua into the
device opening message. The rnbd-server sends the information
if the target device supports the write-back cache and fua
and rnbd-client recevives it and set the device queue accordingly.
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
[jwang: some minor change, rename a few varables, remove unrelated comments.]
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the removal of the session sysfs can also be called from the
function destroy_sess, there is a need to protect the call from the
function rnbd_srv_sess_dev_force_close
Fixes: 786998050c ("block/rnbd-srv: close a mapped device from server side.")
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In error case, we do not free the memory for blk_symlink_name.
Do it by free the memory in error case, and set to NULL
afterwards.
Also fix the condition in rnbd_clt_remove_dev_symlink.
Fixes: 64e8a6ece1 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name")
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kernel test robot triggerred the following warning,
>> drivers/block/rnbd/rnbd-clt.c:1397:42: warning: size argument in
'strlcpy' call appears to be size of the source; expected the size of the
destination [-Wstrlcpy-strlcat-size]
strlcpy(dev->pathname, pathname, strlen(pathname) + 1);
~~~~~~~^~~~~~~~~~~~~
To get rid of the above warning, use a kstrdup as Bart suggested.
Fixes: 64e8a6ece1 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
URPeaBaQJw==
=ayp2
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Nothing major in here:
- NVMe pull request from Christoph:
- nvmet passthrough improvements (Chaitanya Kulkarni)
- fcloop error injection support (James Smart)
- read-only support for zoned namespaces without Zone Append
(Javier González)
- improve some error message (Minwoo Im)
- reject I/O to offline fabrics namespaces (Victor Gladkov)
- PCI queue allocation cleanups (Niklas Schnelle)
- remove an unused allocation in nvmet (Amit Engel)
- a Kconfig spelling fix (Colin Ian King)
- nvme_req_qid simplication (Baolin Wang)
- MD pull request from Song:
- Fix race condition in md_ioctl() (Dae R. Jeong)
- Initialize read_slot properly for raid10 (Kevin Vigor)
- Code cleanup (Pankaj Gupta)
- md-cluster resync/reshape fix (Zhao Heming)
- Move null_blk into its own directory (Damien Le Moal)
- null_blk zone and discard improvements (Damien Le Moal)
- bcache race fix (Dongsheng Yang)
- Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
Lutz Pogrell, Md Haris Iqbal)
- lightnvm NULL pointer deref fix (tangzhenhao)
- sr in_interrupt() removal (Sebastian Andrzej Siewior)
- FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
as it made it easier for them to funnel the feature through the
block driver tree.
- Follow up fixes (Colin Ian King)"
* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
block: drop dead assignments in loop_init()
sr: Remove in_interrupt() usage in sr_init_command().
sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
cdrom: Reset sector_size back it is not 2048.
drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
null_blk: Move driver into its own directory
null_blk: Allow controlling max_hw_sectors limit
null_blk: discard zones on reset
null_blk: cleanup discard handling
null_blk: Improve implicit zone close
null_blk: improve zone locking
block: Align max_hw_sectors to logical blocksize
null_blk: Fail zone append to conventional zones
null_blk: Fix zone size initialization
bcache: fix race between setting bdev state to none and new write request direct to backing
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
block/rnbd: call kobject_put in the failure path
Documentation/ABI/rnbd-srv: add document for force_close
block/rnbd-srv: close a mapped device from server side.
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
Q1Xqs6xwww==
=zo4w
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.
This contains:
- blk-iocost improvements (Baolin Wang)
- part0 iostat fix (Jeffle Xu)
- Disable iopoll for split bios (Jeffle Xu)
- block tracepoint cleanups (Christoph Hellwig)
- Merging of struct block_device and hd_struct (Christoph Hellwig)
- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)
- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)
- Block device ioctl cleanups (Christoph Hellwig)
- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)
- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)
- sbitmap improvements (Pavel Begunkov)
- Hybrid polling fix (Pavel Begunkov)
- bvec iteration improvements (Pavel Begunkov)
- Zone revalidation fixes (Damien Le Moal)
- blk-throttle limit fix (Yu Kuai)
- Various little fixes"
* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX9iqjQAKCRCAXGG7T9hj
vvNXAQCNirQi0LTN7LGZvL2eDTg1SDcq7XII8YomcJl8UFBwvQD9HtRo5CwGzVKc
PbDqGdvZdl9+VyVyeEvZTv/baaollAo=
=MPh4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Fixes for security issues just having been disclosed:
- a five patch series for fixing of XSA-349 (DoS via resource
depletion in Xen dom0)
- a patch fixing XSA-350 (access of stale pointer in a Xen dom0)"
* tag 'for-linus-5.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-blkback: set ring->xenblkd to NULL after kthread_stop()
xenbus/xenbus_backend: Disallow pending watch messages
xen/xenbus: Count pending messages for each watch
xen/xenbus/xen_bus_type: Support will_handle watch callback
xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
xen/xenbus: Allow watches discard events before queueing
From the beginning, the zram block device always enabled CRYPTO_LZO,
since lzo-rle is hardcoded as the fallback compression algorithm. As a
consequence, on systems where another compression algorithm is chosen
(e.g. CRYPTO_ZSTD), the lzo kernel module becomes unused, while still
having to be built/loaded.
This patch removes the hardcoded lzo-rle dependency and allows the user
to select the default compression algorithm for zram at build time. The
previous behaviour is kept, as the default algorithm is still lzo-rle.
Link: https://lkml.kernel.org/r/20201207121245.50529-1-rsalvaterra@gmail.com
Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com>
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, zram supports the stat via /sys/block/zram/mm_stat to represent
how many of incompressible pages are stored at the moment but it couldn't
show how many times incompressible pages were wrote down since zram set
up. It's also good indication to see how zram is effective in the system.
Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is demand to writeback specific process pages to backing store
instead of all idles pages in the system due to storage wear out concerns
and to launching latency of apps which are most of the time idle but are
critical for resume latency.
This patch extends the writeback knob to support a specific page
writeback.
Link: https://lkml.kernel.org/r/20201020190506.3758660-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For libceph, this ensures that libceph instance sharing (share option)
continues to work. For rbd, this avoids blocklisting alive lock owners
(locker addr is always LEGACY, while watcher addr is ANY in nautilus).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When xen_blkif_disconnect() is called, the kernel thread behind the
block interface is stopped by calling kthread_stop(ring->xenblkd).
The ring->xenblkd thread pointer being non-NULL determines if the
thread has been already stopped.
Normally, the thread's function xen_blkif_schedule() sets the
ring->xenblkd to NULL, when the thread's main loop ends.
However, when the thread has not been started yet (i.e.
wake_up_process() has not been called on it), the xen_blkif_schedule()
function would not be called yet.
In such case the kthread_stop() call returns -EINTR and the
ring->xenblkd remains dangling.
When this happens, any consecutive call to xen_blkif_disconnect (for
example in frontend_changed() callback) leads to a kernel crash in
kthread_stop() (e.g. NULL pointer dereference in exit_creds()).
This is XSA-350.
Cc: <stable@vger.kernel.org> # 4.12
Fixes: a24fa22ce2 ("xen/blkback: don't use xen_blkif_get() in xen-blkback kthread")
Reported-by: Olivier Benjamin <oliben@amazon.com>
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Some code does not directly make 'xenbus_watch' object and call
'register_xenbus_watch()' but use 'xenbus_watch_path()' instead. This
commit adds support of 'will_handle' callback in the
'xenbus_watch_path()' and it's wrapper, 'xenbus_watch_pathfmt()'.
This is part of XSA-349
Cc: stable@vger.kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reported-by: Michael Kurth <mku@amazon.de>
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Commit 8410d38c25 ("loop: use __register_blkdev to allocate devices on
demand") simplified loop_init(); so computing the range of the block region
is not required anymore and can be dropped.
Drop dead assignments in loop_init().
As compilers will detect these unneeded assignments and optimize this,
the resulting object code is identical before and after this change.
No functional change. No change in object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of having similar helpers in multiple backend drivers use
common helpers for caching pages allocated via gnttab_alloc_pages().
Make use of those helpers in blkback and scsiback.
Cc: <stable@vger.kernel.org> # 5.9
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovksy@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Move null_blk driver code into the new sub-directory
drivers/block/null_blk.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the module option and configfs attribute max_sectors to allow
configuring the maximum size of a command issued to a null_blk device.
This allows exercising the block layer BIO splitting with different
limits than the default BLK_SAFE_MAX_SECTORS. This is also useful for
testing the zone append write path of file systems as the max_hw_sectors
limit value is also used for the max_zone_append_sectors limit.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When memory backing is enabled, use null_handle_discard() to free the
backing memory used by a zone when the zone is being reset.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
null_handle_discard() is called from both null_handle_rq() and
null_handle_bio(). As these functions are only passed a nullb_cmd
structure, this forces pointer dereferences to identiify the discard
operation code and to access the sector range to be discarded.
Simplify all this by changing the interface of the functions
null_handle_discard() and null_handle_memory_backed() to pass along
the operation code, operation start sector and number of sectors. With
this change null_handle_discard() can be called directly from
null_handle_memory_backed().
Also add a message warning that the discard configuration attribute has
no effect when memory backing is disabled.
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When open zone resource management is enabled, that is, when a null_blk
zoned device is created with zone_max_open different than 0, implicitly
or explicitly opening a zone may require implicitly closing a zone
that is already implicitly open. This operation is done using the
function null_close_first_imp_zone(), which search for an implicitly
open zone to close starting from the first sequential zone. This
implementation is simple but may result in the same being constantly
implicitly closed and then implicitly reopened on write, namely, the
lowest numbered zone that is being written.
Avoid this by starting the search for an implicitly open zone starting
from the zone following the last zone that was implicitly closed. The
function null_close_first_imp_zone() is renamed
null_close_imp_open_zone().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With memory backing disabled, using a single spinlock for protecting
zone information and zone resource management prevents the parallel
execution on multiple queue of IO requests to different zones.
Furthermore, regardless of the use of memory backing, if a null_blk
device is created without limits on the number of open and active zones,
accounting for zone resource management is not necessary.
>From these observations, zone locking is changed as follows to improve
performance:
1) the zone_lock spinlock is renamed zone_res_lock and used only if
zone resource management is necessary, that is, if either
zone_max_open or zone_max_active are not 0. This is indicated using
the new boolean need_zone_res_mgmt in the nullb_device structure.
null_zone_write() is modified to reduce the amount of code executed
with the zone_res_lock spinlock held.
2) With memory backing disabled, per zone locking is changed to a
spinlock per zone.
3) Introduce the structure nullb_zone to replace the use of
struct blk_zone for zone information. This new structure includes a
union of a spinlock and a mutex for zone locking. The spinlock is
used when memory backing is disabled and the mutex is used with
memory backing.
With these changes, fio performance with zonemode=zbd for 4K random
read and random write on a dual socket (24 cores per socket) machine
using the none schedulder is as follows:
before patch:
write (psync x 96 jobs) = 465 KIOPS
read (libaio@qd=8 x 96 jobs) = 1361 KIOPS
after patch:
write (psync x 96 jobs) = 456 KIOPS
read (libaio@qd=8 x 96 jobs) = 4096 KIOPS
Write performance remains mostly unchanged but read performance is three
times higher. Performance when using the mq-deadline scheduler is not
changed by this patch as mq-deadline becomes the bottleneck for a
multi-queue device.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conventional zones do not have a write pointer and so cannot accept zone
append writes. Make sure to fail any zone append write command issued to
a conventional zone.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: e0489ed5da ("null_blk: Support REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a null_blk device with zoned mode enabled is currently initialized
with a number of zones equal to the device capacity divided by the zone
size, without considering if the device capacity is a multiple of the
zone size. If the zone size is not a divisor of the capacity, the zones
end up not covering the entire capacity, potentially resulting is out
of bounds accesses to the zone array.
Fix this by adding one last smaller zone with a size equal to the
remainder of the disk capacity divided by the zone size if the capacity
is not a multiple of the zone size. For such smaller last zone, the zone
capacity is also checked so that it does not exceed the smaller zone
size.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: ca4b2a0119 ("null_blk: add zone support")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently in the case where dev->blk_symlink_name fails to be allocates
the error return path attempts to set an end-of-string character to
the unallocated dev->blk_symlink_name causing a null pointer dereference
error. Fix this by returning with an explicity ENOMEM error (which also
is missing in the original code as was not initialized).
Fixes: 1eb54f8f5d ("block/rnbd: client: sysfs interface functions")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Addresses-Coverity: ("Dereference after null check")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For every rnbd_clt_dev, we alloc the pathname and blk_symlink_name
statically to NAME_MAX which is 255 bytes. In most of the cases we only
need less than 10 bytes, so 500 bytes per block device are wasted.
This commit dynamically allocates memory buffer for pathname and
blk_symlink_name.
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Per the comment of kobject_init_and_add, we need to cleanup the memory
by call kobject_put.
Also we need to call kobject_del for the other failure cases if the
kobject_init_and_add doesn't fail.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The forceful close of an exported device is required
for the use case, when the client side hangs, is crashed,
or is not accessible.
There have been cases observed, where only some of
the devices are to be cleaned up, but the session shall
remain.
When the device is to be exported to a different
client host, server side cleanup is required.
Signed-off-by: Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously, we can't map same device name from different sessions
due to the limitation of sysfs naming mechanism.
root@clt2:~# ls -l /sys/class/rnbd-client/ctl/devices/
total 0
lrwxrwxrwx 1 root 0 Sep 2 16:31 !dev!nullb1 -> ../../../block/rnbd0
We only use the device name in above, which caused device with
the same name can't be mapped from another server. To address
the issue, the sessname is appended to the node to differentiate
where the device comes from.
Also, we need to check if the pathname is existed in a specific
session instead of search it in global sess_list.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During map_device if the given session exists, then the path parameter is
not used. In such a case, the path parameter is redundant.
This commit makes the path parameter optional for map_device. When the
path parameter is not given, if the session exists then that is used to
establish the rtrs connection.
If the session does not exist, and the path parameter is also missing,
then map_device fails.
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The driver core ignores the return value of struct device_driver::remove
because there is only little that can be done. For the shutdown callback
it's ps3_system_bus_shutdown() which ignores the return value.
To simplify the quest to make struct device_driver::remove return void,
let struct ps3_system_bus_driver::remove return void, too. All users
already unconditionally return 0, this commit makes it obvious that
returning an error code is a bad idea and ensures future users behave
accordingly.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126165950.2554997-2-u.kleine-koenig@pengutronix.de
We can just dereference the point in struct gendisk instead. Also
remove the now unused export.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.
Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct. The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.
Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes. By only using the block_device field
this problem also gets fixed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stop passing the whole device as a separate argument given that it
can be trivially deducted and cleanup the !holder debug check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that each hd_struct has a reference to the corresponding
block_device, there is no need for the bd_contains pointer. Add
a bdev_whole() helper to look up the whole device block_device
struture instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case). This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set_blocksize is used by file systems to use their preferred buffer cache
block size. Block drivers should not set it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set_blocksize is used by file systems to use their preferred buffer cache
block size. Block drivers should not set it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
del_gendisk already calls fsync_bdev for every partition, no need
to do this twice.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
revalidate_disk_size just updates the block device size from the disk
size. Thus calling it from virtblk_update_cache_mode doesn't actually
do anything.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to set the size of both the disk and block
device. This also gets the uevent notifications for the resize for free.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Updating the block device size from irq context can lead to torn
writes of the 64-bit value, and prevents us from using normal
process context locking primitives to serialize access to the 64-bit
nr_sectors value. Defer the set_capacity to the already existing
workqueue handler, where it can be merged with the update of the
block device size by using set_capacity_and_notify. As an extra
bonus this also adds proper uevent notifications for the resize.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_capacity_and_notify to update the disk and block device sizes and
send a RESIZE uevent to userspace. Note that blktests relies on uevents
being sent also for updates that did not change the device size, so the
explicit kobject_uevent remains for that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the validation of the block from the callers into nbd_set_size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge nbd_size_set and nbd_size_update into a single function that also
updates the nbd_config fields. This new function takes the device size
in bytes as the first argument, and the blocksize as the second argument,
simplifying the calculations required in most callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nbd_size_update is about to acquire a few more callers, so lift the check
into the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Block driver have no business setting the file system concept of a
block size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The update_bdev argument is always set to true, so remove it. Also
rename the function to the slighly less verbose set_capacity_and_notify,
as propagating the disk size to the block device isn't really
revalidation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason to call revalidate_disk_size separately.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use separate gendisks (which share a tag_set) for the different operating
modes instead of redirecting the gendisk lookup using a probe callback.
This avoids potential problems with aliased block_device instances and
will eventually allow for removing the blk_register_region framework.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
reindent the driver using Lident as the code style was far away from
normal Linux code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The Atari floppy driver usually autodetects the media when used with the
ormal /dev/fd? devices, which also are the only nodes created by udev.
But it also supports various aliases that force a given media format.
That is currently supported using the blk_register_region framework
which finds the floppy gendisk even for a 'mismatched' dev_t. The
problem with this (besides the code complexity) is that it creates
multiple struct block_device instances for the whole device of a
single gendisk, which can lead to interesting issues in code not
aware of that fact.
To fix this just create a separate gendisk for each of the aliases
if they are accessed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use separate gendisks (which share a tag_set) for the native Amgiga vs
the MS-DOS mode instead of redirecting the gendisk lookup using a probe
callback. This avoids potential problems with aliased block_device
instances and will eventually allow for removing the blk_register_region
framework.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The floppy driver usually autodetects the media when used with the
normal /dev/fd? devices, which also are the only nodes created by udev.
But it also supports various aliases that force a given media format.
That is currently supported using the blk_register_region framework
which finds the floppy gendisk even for a 'mismatched' dev_t. The
problem with this (besides the code complexity) is that it creates
multiple struct block_device instances for the whole device of a
single gendisk, which can lead to interesting issues in code not
aware of that fact.
To fix this just create a separate gendisk for each of the aliases
if they are accessed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The swim driver (unlike various other floppy drivers) doesn't have
magic device nodes for certain modes, and already registers a gendisk
for each of the floppies supported by a device. Thus the region
registered is a no-op and can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use set_disk_ro instead of set_device_ro to match all other block
drivers and to ensure all partitions mirror the read-only flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement the ->set_read_only method instead of parsing the actual
ioctl command.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 716ad0986c ("loop: Switch to set_capacity_revalidate_and_notify")
causes an occasional drop of loop device uevent, which are no longer
triggered in loop_set_size() but in a different part of code.
Bug is reproducible with LTP test uevent01 [1]:
i=0; while true; do
i=$((i+1)); echo "== $i =="
lsmod |grep -q loop && rmmod -f loop
./uevent01 || break
done
Put back triggering through code called in loop_set_size().
Fix required to add yet another parameter to
set_capacity_revalidate_and_notify().
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/uevents/uevent01.c
[hch: rebased on a different change to the prototype of
set_capacity_revalidate_and_notify]
Cc: stable@vger.kernel.org # v5.9
Fixes: 716ad0986c ("loop: Switch to set_capacity_revalidate_and_notify")
Reported-by: <ltp@lists.linux.it>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdget_disk needs to be paired with bdput to not leak a reference
on the block device inode.
Fixes: 08ba91ee6e ("nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit aa1c09cb65 ("null_blk: Fix locking in zoned mode") changed
zone locking to using the potentially sleeping wait_on_bit_io()
function. This is acceptable when memory backing is enabled as the
device queue is in that case marked as blocking, but this triggers a
scheduling while in atomic context with memory backing disabled.
Fix this by relying solely on the device zone spinlock for zone
information protection without temporarily releasing this lock around
null_process_cmd() execution in null_zone_write(). This is OK to do
since when memory backing is disabled, command processing does not
block and the memory backing lock nullb->lock is unused. This solution
avoids the overhead of having to mark a zoned null_blk device queue as
blocking when memory backing is unused.
This patch also adds comments to the zone locking code to explain the
unusual locking scheme.
Fixes: aa1c09cb65 ("null_blk: Fix locking in zoned mode")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use platform_get_resource() to fetch the memory resource and
platform_get_irq_optional() to get optional IRQ instead of
open-coded variants.
IRQ is not supposed to be changed at runtime, so there is
no functional change in ace_fsm_yieldirq().
On the other hand we now take first resources instead of last ones
to proceed. I can't imagine how broken should be firmware to have
a garbage in the first resource slots. But if it the case, it needs
to be documented.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the zoned mode is enabled in null_blk, Serializing read, write
and zone management operations for each zone is necessary to protect
device level information for managing zone resources (zone open and
closed counters) as well as each zone condition and write pointer
position. Commit 35bc10b2ea ("null_blk: synchronization fix for
zoned device") introduced a spinlock to implement this serialization.
However, when memory backing is also enabled, GFP_NOIO memory
allocations are executed under the spinlock, resulting in might_sleep()
warnings. Furthermore, the zone_lock spinlock is locked/unlocked using
spin_lock_irq/spin_unlock_irq, similarly to the memory backing code with
the nullb->lock spinlock. This nested use of irq locks wrecks the irq
enabled/disabled state.
Fix all this by introducing a bitmap for per-zone lock, with locking
implemented using wait_on_bit_lock_io() and clear_and_wake_up_bit().
This locking mechanism allows keeping a zone locked while executing
null_process_cmd(), serializing all operations to the zone while
allowing to sleep during memory backing allocation with GFP_NOIO.
Device level zone resource management information is protected using
a spinlock which is not held while executing null_process_cmd();
Fixes: 35bc10b2ea ("null_blk: synchronization fix for zoned device")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the cae of the REQ_OP_ZONE_RESET_ALL operation, the command sector is
ignored and the operation is applied to all sequential zones. For these
commands, tracing the effect of the command using the command sector to
determine the target zone is thus incorrect.
Fix null_zone_mgmt() zone condition tracing in the case of
REQ_OP_ZONE_RESET_ALL to apply tracing to all sequential zones that are
not already empty.
Fixes: 766c3297d7 ("null_blk: add trace in null_blk_zoned.c")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mounted NBD device can be resized, one use case is rbd-nbd.
Fix the issue by setting up default block size, then not touch it
in nbd_size_update() any more. This kind of usage is aligned with loop
which has same use case too.
Cc: stable@vger.kernel.org
Fixes: c8a83a6b54 ("nbd: Use set_blocksize() to set device blocksize")
Reported-by: lining <lining2020x@163.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jan Kara <jack@suse.cz>
Tested-by: lining <lining2020x@163.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Parallel write,read,zone-mgmt operations accessing/altering zone state
and write-pointer may get into race. Avoid the situation by using a new
spinlock for zoned device.
Concurrent zone-appends (on a zone) returning same write-pointer issue
is also avoided using this lock.
Cc: stable@vger.kernel.org
Fixes: e0489ed5da ("null_blk: Support REQ_OP_ZONE_APPEND")
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX5VVeQAKCRCAXGG7T9hj
voI0AQD3ol/EN9uHW1qKduBI/nl5tgv325Zri8CMu60kS45pgAD/ccUXRcHojs3l
YIfgcgT4qKQFWzv57Fc9FUBQJMahJgM=
=6ZgH
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.10b-rc1c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- a series for the Xen pv block drivers adding module parameters for
better control of resource usge
- a cleanup series for the Xen event driver
* tag 'for-linus-5.10b-rc1c-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Documentation: add xen.fifo_events kernel parameter description
xen/events: unmask a fifo event channel only if it was masked
xen/events: only register debug interrupt for 2-level events
xen/events: make struct irq_info private to events_base.c
xen: remove no longer used functions
xen-blkfront: Apply changed parameter name to the document
xen-blkfront: add a parameter for disabling of persistent grants
xen-blkback: add a parameter for disabling of persistent grants
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+UQjkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvN9D/9iHU7Vgi8J3SiLrHYUiDtMSI5VnEmSBo6K
Ej/wbbrk4tm2UYi550krOk0dMaHxWD3XWSTlpnrE0sUfjs69G676yzxrlnPt50f5
XbcMc0YOHZfffeu9xXykUO8Q2918PTPC08eLaTK1I8lhKAuuTFCT/syGYu+prfd7
AogyuczaDok8nqJEK9QNr0iaEUbe17GQwmvpWyjHl/qfKhWvV2r6jCZZf6pzQj2c
zv3kbiT3u6xw9OEuhY0sgpTEfhAHEXbNIln6Ob4qVgxmOjwgiZdU/QXyw1i2s6pc
ks7e28P43r3VfNYGBfr/hQCeAJT9gOeUG5yBiQr7ooX6uNPL6GOCG7DO/g5y2thQ
NkV4hub/FjYWbSmRzDlJGj1fWn4L+3r/O8g5nMr+F1L3JYeaW0hOyStqBQ4O74Cj
04tvWQ8ndXdPQrm/iDhM6KxfCvR5TC6k4fy9XPpRW8JOxauhIwTZQJyEQUnXTH3v
pwv3IxRmuWGa3mrJZ5kGhsNAEGHdZCL5soLI+BXAD2MUW2IB5v2HpD/z1bvWL/51
uYiVIt/2LxgLkF7BXP40PnY0qqTsOwGxdd6wQhi5Jn9Et+JkmAAR6cVwXx4AhuQg
FT5mq7ZTQBZrErQu4Mr1k3UyqBFm4MB+mbJhWrVWnUnnyA6pcr1NUsUTz5JcyrWz
jWI7T1Si7w==
=dFJi
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph
- rdma error handling fixes (Chao Leng)
- fc error handling and reconnect fixes (James Smart)
- fix the qid displace when tracing ioctl command (Keith Busch)
- don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
- fix MTDT for passthru (Logan Gunthorpe)
- blacklist Write Same on more devices (Kai-Heng Feng)
- fix an uninitialized work struct (zhenwei pi)"
- lightnvm out-of-bounds fix (Colin)
- SG allocation leak fix (Doug)
- rnbd fixes (Gioh, Guoqing, Jack)
- zone error translation fixes (Keith)
- kerneldoc markup fix (Mauro)
- zram lockdep fix (Peter)
- Kill unused io_context members (Yufen)
- NUMA memory allocation cleanup (Xianting)
- NBD config wakeup fix (Xiubo)
* tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
block: blk-mq: fix a kernel-doc markup
nvme-fc: shorten reconnect delay if possible for FC
nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
nvme-fc: fix error loop in create_hw_io_queues
nvme-fc: fix io timeout to abort I/O
null_blk: use zone status for max active/open
nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
nvmet: cleanup nvmet_passthru_map_sg()
nvmet: limit passthru MTDS by BIO_MAX_PAGES
nvmet: fix uninitialized work for zero kato
nvme-pci: disable Write Zeroes on Sandisk Skyhawk
nvme: use queuedata for nvme_req_qid
nvme-rdma: fix crash due to incorrect cqe
nvme-rdma: fix crash when connect rejected
block: remove unused members for io_context
blk-mq: remove the calling of local_memory_node()
zram: Fix __zram_bvec_{read,write}() locking order
skd_main: remove unused including <linux/version.h>
sgl_alloc_order: fix memory leak
lightnvm: fix out-of-bounds write to array devices->info[]
...
The block layer provides special status codes when requests go beyond
the zone resource limits. Use these codes instead of the generic IOERR
for requests that exceed the max active or open limits the null_blk
device was configured with so that applications know how these special
conditions should be handled.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- a patch that removes crush_workspace_mutex (myself). CRUSH
computations are no longer serialized and can run in parallel.
- a couple new filesystem client metrics for "ceph fs top" command
(Xiubo Li)
- a fix for a very old messenger bug that affected the filesystem,
marked for stable (myself)
- assorted fixups and cleanups throughout the codebase from Jeff
and others.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl+QN8cTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4iwCACLwUvO0ONTzUUb2D8ftfyXjo/jIod5
eO7RHfCHOP2A83GQpdYdktVDSVeOUIOiPhxAxo1dL4GieI2/saXrnoevam7ogZkA
OmR4drdtRVUqF/aATrtjiDMg2ge0dnx5gfjMxSP/FiPPpXOdrtxng/7tv8yo+03q
AlMqg/YcxO06t1M1qh9SEyfzjcHyPnJU2i0ienngxnGxQ7QiMOR6anF1LNhtN803
4fTBX2tLqDNAa+x5yF1kKSn9OmFNhc0oUsqef+Ck0Vw1LC0/SuxzE2J904iehAwy
/HzJCer0zcp+eZhn3HhoHmp0UZpOC1dzMxa3CHyRJFrxCSTEhY8iqPiJ
=wGr7
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
- a patch that removes crush_workspace_mutex (myself). CRUSH
computations are no longer serialized and can run in parallel.
- a couple new filesystem client metrics for "ceph fs top" command
(Xiubo Li)
- a fix for a very old messenger bug that affected the filesystem,
marked for stable (myself)
- assorted fixups and cleanups throughout the codebase from Jeff and
others.
* tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits)
libceph: clear con->out_msg on Policy::stateful_server faults
libceph: format ceph_entity_addr nonces as unsigned
libceph: fix ENTITY_NAME format suggestion
libceph: move a dout in queue_con_delay()
ceph: comment cleanups and clarifications
ceph: break up send_cap_msg
ceph: drop separate mdsc argument from __send_cap
ceph: promote to unsigned long long before shifting
ceph: don't SetPageError on readpage errors
ceph: mark ceph_fmt_xattr() as printf-like for better type checking
ceph: fold ceph_update_writeable_page into ceph_write_begin
ceph: fold ceph_sync_writepages into writepage_nounlock
ceph: fold ceph_sync_readpages into ceph_readpage
ceph: don't call ceph_update_writeable_page from page_mkwrite
ceph: break out writeback of incompatible snap context to separate function
ceph: add a note explaining session reject error string
libceph: switch to the new "osd blocklist add" command
libceph, rbd, ceph: "blacklist" -> "blocklist"
ceph: have ceph_writepages_start call pagevec_lookup_range_tag
ceph: use kill_anon_super helper
...
Persistent grants feature provides high scalability. On some small
systems, however, it could incur data copy overheads[1] and thus it is
required to be disabled. It can be disabled from blkback side using a
module parameter, 'feature_persistent'. But, it is impossible from
blkfront side. For the reason, this commit adds a blkfront module
parameter for disabling of the feature.
[1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20200923061841.20531-3-sjpark@amazon.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Persistent grants feature provides high scalability. On some small
systems, however, it could incur data copy overheads[1] and thus it is
required to be disabled. But, there is no option to disable it. For
the reason, this commit adds a module parameter for disabling of the
feature.
[1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
Signed-off-by: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20200923061841.20531-2-sjpark@amazon.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX47RxAAKCRCAXGG7T9hj
vj7WAP94Cj5cQXPGaTD9GAVLVLgF6L2wPxUT1/gqLMulfRgHdAEAqVidg8fLhI6G
vgQiDwYp/svqZco2qA5i4YN+Xnr6Vg0=
=V/7G
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.10b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- A single patch to fix the Xen security issue XSA-331 (malicious
guests can DoS dom0 by triggering NULL-pointer dereferences or access
to stale data).
- A larger series to fix the Xen security issue XSA-332 (malicious
guests can DoS dom0 by sending events at high frequency leading to
dom0's vcpus being busy in IRQ handling for elongated times).
* tag 'for-linus-5.10b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: block rogue events for some time
xen/events: defer eoi in case of excessive number of events
xen/events: use a common cpu hotplug hook for event channels
xen/events: switch user event channels to lateeoi model
xen/pciback: use lateeoi irq binding
xen/pvcallsback: use lateeoi irq binding
xen/scsiback: use lateeoi irq binding
xen/netback: use lateeoi irq binding
xen/blkback: use lateeoi irq binding
xen/events: add a new "late EOI" evtchn framework
xen/events: fix race in evtchn_fifo_unmask()
xen/events: add a proper barrier to 2-level uevent unmasking
xen/events: avoid removing an event channel while handling it
In order to reduce the chance for the system becoming unresponsive due
to event storms triggered by a misbehaving blkfront use the lateeoi
irq binding for blkback and unmask the event channel only after
processing all pending requests.
As the thread processing requests is used to do purging work in regular
intervals an EOI may be sent only after having received an event. If
there was no pending I/O request flag the EOI as spurious.
This is part of XSA-332.
Cc: stable@vger.kernel.org
Reported-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Mikhail reported a lockdep spat detailing how __zram_bvec_read() and
__zram_bvec_write() use zstrm->lock and zspage->lock in opposite order.
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge more updates from Andrew Morton:
"155 patches.
Subsystems affected by this patch series: mm (dax, debug, thp,
readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
romfs, and fault-injection"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
lib, uaccess: add failure injection to usercopy functions
lib, include/linux: add usercopy failure capability
ROMFS: support inode blocks calculation
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
sched.h: drop in_ubsan field when UBSAN is in trap mode
scripts/gdb/tasks: add headers and improve spacing format
scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
kernel/relay.c: drop unneeded initialization
panic: dump registers on panic_on_warn
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
nilfs2: fix some kernel-doc warnings for nilfs2
autofs: harden ioctl table
ramfs: fix nommu mmap with gaps in the page cache
mm: remove the now-unnecessary mmget_still_valid() hack
mm/gup: take mmap_lock in get_dump_page()
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
coredump: refactor page range dumping into common helper
coredump: let dump_emit() bail out on short writes
...
If we fail to decompress in zram it's a pretty serious problem. We were
entrusted to be able to decompress the old data but we failed. Either
we've got some crazy bug in the compression code or we've got memory
corruption.
At the moment, when this happens the log looks like this:
ERR kernel: [ 1833.099861] zram: Decompression failed! err=-22, page=336112
ERR kernel: [ 1833.099881] zram: Decompression failed! err=-22, page=336112
ALERT kernel: [ 1833.099886] Read-error on swap-device (253:0:2688896)
It is true that we have an "ALERT" level log in there, but (at least to
me) it feels like even this isn't enough to impart the seriousness of this
error. Let's convert to a WARN_ON. Note that WARN_ON is automatically
"unlikely" so we can simply replace the old annotation with the new one.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lkml.kernel.org/r/20200917174059.1.If09c882545dbe432268f7a67a4d4cfcb6caace4f@changeid
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
traversal in common container configs and improving TCP back-pressure.
Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
Expand netlink policy support and improve policy export to user space.
(Ge)netlink core performs request validation according to declared
policies. Expand the expressiveness of those policies (min/max length
and bitmasks). Allow dumping policies for particular commands.
This is used for feature discovery by user space (instead of kernel
version parsing or trial and error).
Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
Allow more than 255 IPv4 multicast interfaces.
Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.
In Multi-patch TCP (MPTCP) support concurrent transmission of data
on multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.
Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
Allow more calls to same peer in RxRPC.
Support two new Controller Area Network (CAN) protocols -
CAN-FD and ISO 15765-2:2016.
Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.
Add TC actions for implementing MPLS L2 VPNs.
Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary notifications
and make it easier to offload nexthops into HW by converting
to a blocking notifier.
Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific
TCP option use.
Reorganize TCP congestion control (CC) initialization to simplify life
of TCP CC implemented in BPF.
Add support for shipping BPF programs with the kernel and loading them
early on boot via the User Mode Driver mechanism, hence reusing all the
user space infra we have.
Support sleepable BPF programs, initially targeting LSM and tracing.
Add bpf_d_path() helper for returning full path for given 'struct path'.
Make bpf_tail_call compatible with bpf-to-bpf calls.
Allow BPF programs to call map_update_elem on sockmaps.
Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).
Support listing and getting information about bpf_links via the bpf
syscall.
Enhance kernel interfaces around NIC firmware update. Allow specifying
overwrite mask to control if settings etc. are reset during update;
report expected max time operation may take to users; support firmware
activation without machine reboot incl. limits of how much impact
reset may have (e.g. dropping link or not).
Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.
Adopt or extend devlink use for debug, monitoring, fw update
in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
mv88e6xxx, dpaa2-eth).
In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.
Add XDP support for Intel's igb driver.
Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.
Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.
Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.
Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.
Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.
Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.
Improve performance for packets which don't require much offloads
on recent Mellanox NICs by 20% by making multiple packets share
a descriptor entry.
Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
subtree to drivers/net. Move MDIO drivers out of the phy directory.
Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.
Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
=bc1U
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.
Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).
- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.
- Allow more than 255 IPv4 multicast interfaces.
- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.
- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.
- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.
- Allow more calls to same peer in RxRPC.
- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.
- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.
- Add TC actions for implementing MPLS L2 VPNs.
- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.
- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.
- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.
- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.
- Support sleepable BPF programs, initially targeting LSM and tracing.
- Add bpf_d_path() helper for returning full path for given 'struct
path'.
- Make bpf_tail_call compatible with bpf-to-bpf calls.
- Allow BPF programs to call map_update_elem on sockmaps.
- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).
- Support listing and getting information about bpf_links via the bpf
syscall.
- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).
- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.
- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).
- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.
- Add XDP support for Intel's igb driver.
- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.
- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.
- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.
- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.
- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.
- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.
- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.
- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.
- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.
- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).
* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...
There has one race case for ceph's rbd-nbd tool. When do mapping
it may fail with EBUSY from ioctl(nbd, NBD_DO_IT), but actually
the nbd device has already unmaped.
It dues to if just after the wake_up(), the recv_work() is scheduled
out and defers calling the nbd_config_put(), though the map process
has exited the "nbd->recv_task" is not cleared.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After send_msg_open is done, send_msg_close should be done
if any error occurs and it is necessary to recover
what has been done.
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The max_hw_secotrs is only limited by the transport, not remote device,
block layer on server side will split to the device limit if it's too
big.
The max_segments, similar, and rtrs server will submit single buffer, so
no need to cap.
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The argument is not needed since all callers pass 1 for it.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EYWYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsCgD/9Izy/mbiQMmcBPBuQFds2b2SwPAoB4RVcU
NU7pcI3EbAlcj7xDF08Z74Sr6MKyg+JhGid15iw47o+qFq6cxDKiESYLIrFmb70R
lUDkPr9J4OLNDSZ6hpM4sE6Qg9bzDPhRbAceDQRtVlqjuQdaOS2qZAjNG4qjO8by
3PDO7XHCW+X4HhXiu2PDCKuwyDlHxggYzhBIFZNf58US2BU8+tLn2gvTSvmTb27F
w0s5WU1Q5Q0W9RLrp4YTQi4SIIOq03BTSqpRjqhomIzhSQMieH95XNKGRitLjdap
2mFNJ+5I+DTB/TW2BDBrBRXnoV/QNBJsR0DDFnUZsHEejjXKEVt5BRCpSQC9A0WW
XUyVE1K+3GwgIxSI8tjPtyPEGzzhnqJjzHPq4LJLGlQje95v9JZ6bpODB7HHtZQt
rbNp8IoVQ0n01nIvkkt/vnzCE9VFbWFFQiiu5/+x26iKZXW0pAF9Dnw46nFHoYZi
llYvbKDcAUhSdZI8JuqnSnKhi7sLRNPnApBxs52mSX8qaE91sM2iRFDewYXzaaZG
NjijYCcUtopUvojwxYZaLnIpnKWG4OZqGTNw1IdgzUtfdxoazpg6+4wAF9vo7FEP
AePAUTKrfkGBm95uAP4bRvXBzS9UhXJvBrFW3grzRZybMj617F01yAR4N0xlMXeN
jMLrGe7sWA==
=xE9E
-----END PGP SIGNATURE-----
Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Here are the driver updates for 5.10.
A few SCSI updates in here too, in coordination with Martin as they
depend on core block changes for the shared tag bitmap.
This contains:
- NVMe pull requests via Christoph:
- fix keep alive timer modification (Amit Engel)
- order the PCI ID list more sensibly (Andy Shevchenko)
- cleanup the open by controller helper (Chaitanya Kulkarni)
- use an xarray for the CSE log lookup (Chaitanya Kulkarni)
- support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
- fix nvme_ns_report_zones (Christoph Hellwig)
- add a sanity check to nvmet-fc (James Smart)
- fix interrupt allocation when too many polled queues are
specified (Jeffle Xu)
- small nvmet-tcp optimization (Mark Wunderlich)
- fix a controller refcount leak on init failure (Chaitanya
Kulkarni)
- misc cleanups (Chaitanya Kulkarni)
- major refactoring of the scanning code (Christoph Hellwig)
- MD updates via Song:
- Bug fixes in bitmap code, from Zhao Heming
- Fix a work queue check, from Guoqing Jiang
- Fix raid5 oops with reshape, from Song Liu
- Clean up unused code, from Jason Yan
- Discard improvements, from Xiao Ni
- raid5/6 page offset support, from Yufen Yu
- Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
Hannes)
- null_blk open/active zone limit support (Niklas)
- Set of bcache updates (Coly, Dongsheng, Qinglang)"
* tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
md/raid5: fix oops during stripe resizing
md/bitmap: fix memory leak of temporary bitmap
md: fix the checking of wrong work queue
md/bitmap: md_bitmap_get_counter returns wrong blocks
md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
md/raid0: remove unused function is_io_in_chunk_boundary()
nvme-core: remove extra condition for vwc
nvme-core: remove extra variable
nvme: remove nvme_identify_ns_list
nvme: refactor nvme_validate_ns
nvme: move nvme_validate_ns
nvme: query namespace identifiers before adding the namespace
nvme: revalidate zone bitmaps in nvme_update_ns_info
nvme: remove nvme_update_formats
nvme: update the known admin effects
nvme: set the queue limits in nvme_update_ns_info
nvme: remove the 0 lba_shift check in nvme_update_ns_info
nvme: clean up the check for too large logic block sizes
nvme: freeze the queue over ->lba_shift updates
nvme: factor out a nvme_configure_metadata helper
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
ZwfpDh2+Tg==
=LzyE
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)
- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)
* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
DRBD keeps a block device open just to get and set the capacity from
it. Switch to primarily using the disk capacity as intended by the
block layer, and sync it to the bdev using revalidate_disk_size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bulk of the genetlink users can use smaller ops, move them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In _drbd_send_page() a page is checked by following code before sending
it by kernel_sendpage(),
(page_count(page) < 1) || PageSlab(page)
If the check is true, this page won't be send by kernel_sendpage() and
handled by sock_no_sendpage().
This kind of check is exactly what macro sendpage_ok() does, which is
introduced into include/linux/net.h to solve a similar send page issue
in nvme-tcp code.
This patch uses macro sendpage_ok() to replace the open coded checks to
page type and refcount in _drbd_send_page(), as a code cleanup.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for user space to set a max open zone and a max active zone
limit via configfs. By default, the default values are 0 == no limit.
Call the block layer API functions used for exposing the configured
limits to sysfs.
Add accounting in null_blk_zoned so that these new limits are respected.
Performing an operation that would exceed these limits results in a
standard I/O error.
A max open zone limit exists in the ZBC standard.
While null_blk_zoned is used to test the Zoned Block Device model in
Linux, when it comes to differences between ZBC and ZNS, null_blk_zoned
mostly follows ZBC.
Therefore, implement the manage open zone resources function from ZBC,
but additionally add support for max active zones.
This enables user space not only to test against a device with an open
zone limit, but also to test against a device with an active zone limit.
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ->bd_contains field is set by __blkdev_get and drivers have no
business manipulating it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bd_disk is set on all block devices, including those for partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Support a shared tag bitmap, whereby request tags are unique over all
submission queues, and not just per submission queue.
As such, per device total queue depth is normally hw_queue_depth *
submit_queues, but hw_queue_depth when set. And a similar story for when
shared_tags is set, where that is the queue depth over all null blk
devices.
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.10/block: (140 commits)
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
bdi: invert BDI_CAP_NO_ACCT_WB
bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
mm: use SWP_SYNCHRONOUS_IO more intelligently
bdi: remove BDI_CAP_SYNCHRONOUS_IO
bdi: remove BDI_CAP_CGROUP_WRITEBACK
block: lift setting the readahead size into the block layer
md: update the optimal I/O size on reshape
bdi: initialize ->ra_pages and ->io_pages in bdi_init
aoe: set an optimal I/O size
bcache: inherit the optimal I/O size
drbd: remove dead code in device_to_statistics
fs: remove the unused SB_I_MULTIROOT flag
block: mark blkdev_get static
PM: mm: cleanup swsusp_swap_check
mm: split swap_type_of
PM: rewrite is_hibernate_resume_dev to not require an inode
mm: cleanup claim_swapfile
ocfs2: cleanup o2hb_region_dev_store
dasd: cleanup dasd_scan_partitions
...
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.
One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drivers shouldn't really mess with the readahead size, as that is a VM
concept. Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk. Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors. To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
aoe forces a larger readahead size, but any reason to do larger I/O
is not limited to readahead. Also set the optimal I/O size, and
remove the local constants in favor of just using SZ_2G.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ever since the switch to blk-mq, a lower device not used for VM
writeback will not be marked congested, so the check will never
trigger.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The pcd driver does not have a ->revalidate_disk method, so it can just
use bdev_check_media_change without any additional changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass a struct ace_device to ace_revalidate_disk, move the media changed
check into the one caller that needs it, and give the routine a better
name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>