Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.
Cleanup, no functional change.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:
- We have to completion structs embedded, that we only use for
quiesce of the ctx (or shutdown) and for sqthread init cases.
That 2x32 bytes right there, let's dynamically allocate them.
- Reorder the struct a bit with an eye on cachelines, use cases,
and holes.
With this patch, we're down to 512 bytes, or 8 cachelines.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- fix a regression from this merge window in the configfs
symlink handling (Honggang Li)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3IG00LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPJbBAAw57uaM2mKcUe1syqwsaliqYVGqZ7X+4sMdtwNPAI
W01aolnmmukcWInlZBqiHFu3+iAVYJND9NvP7ZA+p6/meJfWiIrgWMEVS4X6HFpg
qsBCP1qh8BvLxCzHC+i9tkgPMpicYjImE/pza99ZtrMHTRC7ao5I3GKbWBIdEpDv
5NZDbt2g79Y1YfqGRo0xcY0etwpau+iN4rXivjj4qMO1o+Vt4rWlGZehhD2/r2M0
NGeR3JpYe1MdxvyECMoDI+aWuOiDFrJ/ZfWfTuCskqwTyZ1BtKElBnqS6VFn7yxL
XqPNwe6Sr9fz7RRZXQ1iH8O/SYct25xVyc6JQlAvcpp0emAUsj2bNAQ3Hj4W3Krw
h1D5HKvte+CjIqZEnFlqI6GeHawdbSpJoE1VOECS1rXpYhvs5V+2e5RyT2gmj1Pp
X58Q0xF5Ver2sF0NkhAMTxXL77L8cpRDVljveBXgUh/MzTdPcq1h/RsKXwwVBD23
Vsmg4qKlBAWJLK1TJ4wvvmmp0N/XzcaNWm7cKCxJDBlzY8LpcVTmUJfGty96mHqb
cRZLMNcWbtPFSkHfzkYeuWWnhhtgXBmEcTawOd3y0s+jd8wjhRr2wuhL/MSCicJG
t0A+4/5p4s8H94OAiq9tF8yOAQpvzmrrunW+lyyOQchK4kBBM8rD+cMIRziL+u/2
B84=
=kgn+
-----END PGP SIGNATURE-----
Merge tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs
Pull configfs regression fix from Christoph Hellwig:
"Fix a regression from this merge window in the configfs symlink
handling (Honggang Li)"
* tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs:
configfs: calculate the depth of parent item
We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure. Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.
dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias. So we safely go that way to get to its
underlying dentry.
Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive. So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A problem similar to the one caught in commit 74dd7c97ea ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.
Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
* find the underlying dentry of victim
* find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
* lock the inode of dentry underlying the victim's parent
* check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive. So while ->d_parent itself
might not be stable, the result of comparison is.
If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink(). As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.
The same problem (with the same solution) exists for rmdir. Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.
Fixes: 74dd7c97ea "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
=CZxY
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regressions and fixes for stable.
Regressions:
- fix a race leading to metadata space leak after task received a
signal
- un-deprecate 2 ioctls, marked as deprecated by mistake
Fixes:
- fix limit check for number of devices during chunk allocation
- fix a race due to double evaluation of i_size_read inside max()
macro, can cause a crash
- remove wrong device id check in tree-checker"
* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
Btrfs: fix race leading to metadata space leak after task received signal
btrfs: tree-checker: Fix wrong check on max devid
btrfs: Consider system chunk array size for new SYSTEM chunks
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU
neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen
VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM
F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1
iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD
BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ
09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr
JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn
kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ
rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU
388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU
XXpzsu8pSw==
=8B/z
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Two NVMe device removal crash fixes, and a compat fixup for for an
ioctl that was introduced in this release (Anton, Charles, Max - via
Keith)
- Missing error path mutex unlock for drbd (Dan)
- cgroup writeback fixup on dead memcg (Tejun)
- blkcg online stats print fix (Tejun)
* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
block: drbd: remove a stray unlock in __drbd_send_protocol()
blkcg: make blkcg_print_stat() print stats only for online blkgs
nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
nvme-rdma: fix a segmentation fault during module unload
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead. This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).
Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups. When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.
While c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages. However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.
Skip the immediate switching if the associated memcg is dying.
This is a simplified version of the following two patches:
* https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
* http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: e8a7abf5a5 ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
further restrict copy_file_range() to avoid potential data corruption
from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels. Everything but
the fscache fix is marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3FjtQTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziyjfCACbNiRHpTWfuZ1iHXXCpo0tAixOT6i0
9ZlE8yiF6U6iH7uv/MuIeJ+Qeep9+6wwgJYEJKcF0e+Raiob7womDO+yeDx6RYrC
bqq6OLjJl8VbfeQWvJTmisGTPpvOsOKxxXIKG6vx1zJaJ69yENilFZ0biis14iTu
RRDvtSntc96OhZ3n8IXie2+f9purYeIKhiCZqx9WzYZOk5sb3zGyhqD+Wh5kxkEh
QzwwZ/G5RYvMMhL0o13uMZrjqozGTI9Qm09u7+EikIXFFtt+szEdgFd4pQPhsC1D
VmpdrvV4ml8JnbkENiZ7tlHCcICQJfgMWNcLVgHvD6808CJ0UP5Cq/jP
=34Iz
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Some late-breaking dentry handling fixes from Al and Jeff, a patch to
further restrict copy_file_range() to avoid potential data corruption
from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.
Everything but the fscache fix is marked for stable"
* tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client:
ceph: return -EINVAL if given fsc mount option on kernel w/o support
ceph: don't allow copy_file_range when stripe_count != 1
ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
ceph: add missing check in d_revalidate snapdir handling
ceph: fix RCU case handling in ceph_d_revalidate()
ceph: fix use-after-free in __ceph_remove_cap()
Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.
* for-linus: (942 commits)
blkcg: make blkcg_print_stat() print stats only for online blkgs
nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
nvme-rdma: fix a segmentation fault during module unload
iocost: don't nest spin_lock_irq in ioc_weight_write()
io_uring: ensure we clear io_kiocb->result before each issue
um-ubd: Entrust re-queue to the upper layers
nvme-multipath: remove unused groups_only mode in ana log
nvme-multipath: fix possible io hang after ctrl reconnect
io_uring: don't touch ctx in setup after ring fd install
io_uring: Fix leaked shadow_req
Linux 5.4-rc5
riscv: cleanup do_trap_break
nbd: verify socket is supported during setup
ata: libahci_platform: Fix regulator_get_optional() misuse
nbd: handle racing with error'ed out commands
nbd: protect cmd->status with cmd->lock
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
io_uring: used cached copies of sq->dropped and cq->overflow
ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
...
Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:
- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring supports request types that basically have two different
lifetimes:
1) Bounded completion time. These are requests like disk reads or writes,
which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
have no idea how long they will take to complete. Another example is
POLL commands.
This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hold the wqe lock at this point (which is also annotated), so there's
no need to use the careful variant of list_empty().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The rings can be derived from the ctx, and we need the ctx there for
a future change.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.
This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.
[ Drop ceph prefix -- it's provided by pr_err. ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Zoned block devices (ZBC and ZAC devices) allow an explicit control
over the condition (state) of zones. The operations allowed are:
* Open a zone: Transition to open condition to indicate that a zone will
actively be written
* Close a zone: Transition to closed condition to release the drive
resources used for writing to a zone
* Finish a zone: Transition an open or closed zone to the full
condition to prevent write operations
To enable this control for in-kernel zoned block device users, define
the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
and REQ_OP_ZONE_FINISH as well as the generic function
blkdev_zone_mgmt() for submitting these operations on a range of zones.
This results in blkdev_reset_zones() removal and replacement with this
new zone magement function. Users of blkdev_reset_zones() (f2fs and
dm-zoned) are updated accordingly.
Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the client hits a network reconnect, it re-opens every open
file with a create context to reconnect a persistent handle. All
create context types should be 8-bytes aligned but the padding
was missed for that one. As a result, some servers don't allow
us to reconnect handles and return an error. The problem occurs
when the problematic context is not at the end of the create
request packet. Fix this by adding a proper padding at the end
of the reconnect persistent handle context.
Cc: Stable <stable@vger.kernel.org> # 4.19.x
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.
As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:
1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling
The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
f2fs inode numbers are stable across filesystem resizing, and f2fs inode
and file logical block numbers are always 32-bit. So f2fs can always
support IV_INO_LBLK_64 encryption policies. Wire up the needed
fscrypt_operations to declare support.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
IV_INO_LBLK_64 encryption policies have special requirements from the
filesystem beyond those of the existing encryption policies:
- Inode numbers must never change, even if the filesystem is resized.
- Inode numbers must be <= 32 bits.
- File logical block numbers must be <= 32 bits.
ext4 has 32-bit inode and file logical block numbers. However,
resize2fs can re-number inodes when shrinking an ext4 filesystem.
However, typically the people who would want to use this format don't
care about filesystem shrinking. They'd be fine with a solution that
just prevents the filesystem from being shrunk.
Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that
will do exactly that. Then wire up the fscrypt_operations to expose
this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when
this flag is set.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Inline encryption hardware compliant with the UFS v2.1 standard or with
the upcoming version of the eMMC standard has the following properties:
(1) Per I/O request, the encryption key is specified by a previously
loaded keyslot. There might be only a small number of keyslots.
(2) Per I/O request, the starting IV is specified by a 64-bit "data unit
number" (DUN). IV bits 64-127 are assumed to be 0. The hardware
automatically increments the DUN for each "data unit" of
configurable size in the request, e.g. for each filesystem block.
Property (1) makes it inefficient to use the traditional fscrypt
per-file keys. Property (2) precludes the use of the existing
DIRECT_KEY fscrypt policy flag, which needs at least 192 IV bits.
Therefore, add a new fscrypt policy flag IV_INO_LBLK_64 which causes the
encryption to modified as follows:
- The encryption keys are derived from the master key, encryption mode
number, and filesystem UUID.
- The IVs are chosen as (inode_number << 32) | file_logical_block_num.
For filenames encryption, file_logical_block_num is 0.
Since the file nonces aren't used in the key derivation, many files may
share the same encryption key. This is much more efficient on the
target hardware. Including the inode number in the IVs and mixing the
filesystem UUID into the keys ensures that data in different files is
nevertheless still encrypted differently.
Additionally, limiting the inode and block numbers to 32 bits and
placing the block number in the low bits maintains compatibility with
the 64-bit DUN convention (property (2) above).
Since this scheme assumes that inode numbers are stable (which may
preclude filesystem shrinking) and that inode and file logical block
numbers are at most 32-bit, IV_INO_LBLK_64 will only be allowed on
filesystems that meet these constraints. These are acceptable
limitations for the cases where this format would actually be used.
Note that IV_INO_LBLK_64 is an on-disk format, not an implementation.
This patch just adds support for it using the existing filesystem layer
encryption. A later patch will add support for inline encryption.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
The access to logged_impl_name is technically a data race, which tools
like KCSAN could complain about in the future. See:
https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE
Fix by using xchg(), which also ensures that only one thread does the
logging.
This also required switching from bool to int, to avoid a build error on
the RISC-V architecture which doesn't implement xchg on bytes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.
Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.
The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.
The following is a case. The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.
kernel BUG at fs/ocfs2/extent_map.c:475!
invalid opcode: 0000 [#1] SMP
Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
Call Trace:
ocfs2_fiemap+0x1e3/0x430 [ocfs2]
do_vfs_ioctl+0x155/0x510
SyS_ioctl+0x81/0xa0
system_call_fastpath+0x18/0xd8
Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
RIP ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
---[ end trace c8aa0c8180e869dc ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
This issue can be reproduced every week in a production environment.
This issue is related to the usage mode. If others use ocfs2 in this
mode, the kernel will panic frequently.
[akpm@linux-foundation.org: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had no more use for this flag after the conversion to io-wq, kill it
off.
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.
We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As introduced by commit:
ba816ad61f ("io_uring: run dependent links inline if possible")
enable inline dependent link running for poll commands.
io_poll_complete_work() is the most important change, as it allows a
linked sequence of { POLL, READ } (for example) to proceed inline
instead of needing to get punted to another async context. The
submission side only potentially matters for sqthread, but may as well
include that bit.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't know what context we'll be called in for cancel, it could very
well be with IRQs disabled already. Use the IRQ saving variants of the
locking primitives.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
copy_file_range tries to use the OSD 'copy-from' operation, which simply
performs a full object copy. Unfortunately, the implementation of this
system call assumes that stripe_count is always set to 1 and doesn't take
into account that the data may be striped across an object set. If the
file layout has stripe_count different from 1, then the destination file
data will be corrupted.
For example:
Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
stripe_size of 2 MiB; the first half of the file will be filled with 'A's
and the second half will be filled with 'B's:
0 4M 8M Obj1 Obj2
+------+------+ +----+ +----+
file: | AAAA | BBBB | | AA | | AA |
+------+------+ |----| |----|
| BB | | BB |
+----+ +----+
If we copy_file_range this file into a new file (which needs to have the
same file layout!), then it will start by copying the object starting at
file offset 0 (Obj1). And then it will copy the object starting at file
offset 4M -- which is Obj1 again.
Unfortunately, the solution for this is to not allow remote object copies
to be performed when the file layout stripe_count is not 1 and simply
fallback to the default (VFS) copy_file_range implementation.
Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
that it already passed d_revalidate so we *know* that it's negative (or
at least was very recently). Just return -ENOENT in that case.
This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
call atomic_open with the parent's i_rwsem shared, but calling
d_splice_alias on a hashed dentry requires the exclusive lock.
If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
open, and another client were to race in and create the file before we
issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
the dentry with the new inode with insufficient locks.
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.
The deprecation has been added in this development cycle, remove it
until it's time.
Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic
kernel BUG at mm/page-writeback.c:2659!
RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
Call Trace:
__process_pages_contig+0x25a/0x350
? extent_clear_unlock_delalloc+0x43/0x70
submit_compressed_extents+0x359/0x4d0
normal_work_helper+0x15a/0x330
process_one_work+0x1f5/0x3f0
worker_thread+0x2d/0x3d0
? rescuer_thread+0x340/0x340
kthread+0x111/0x130
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This is happening because the page is not locked when doing
clear_page_dirty_for_io. Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.
This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do
actual_end = min_t(u64, i_size_read(inode), end + 1);
which is problematic because i_size_read() can evaluate to different
values in between checking and assigning. So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.
I confirmed this was what was happening by installing a debug kernel
that had
actual_end = min_t(u64, i_size_read(inode), end + 1);
if (actual_end > end + 1) {
printk(KERN_ERR "KABOOM\n");
actual_end = end + 1;
}
and installing it onto 500 boxes of the tier that had been seeing the
problem regularly. Last night I got my debug message and no panic,
confirming what I expected.
[ dsterba: the assembly confirms a tiny race window:
mov 0x20(%rsp),%rax
cmp %rax,0x48(%r15) # read
movl $0x0,0x18(%rsp)
mov %rax,%r12
mov %r14,%rax
cmovbe 0x48(%r15),%r12 # eval
Where r15 is inode and 0x48 is offset of i_size.
The original fix was to revert 62b3762271 that would do an
intermediate assignment and this would also avoid the doulble
evaluation but is not future-proof, should the compiler merge the
stores and call i_size_read anyway.
There's a patch adding READ_ONCE to i_size_read but that's not being
applied at the moment and we need to fix the bug. Instead, emulate
READ_ONCE by two barrier()s that's what effectively happens. The
assembly confirms single evaluation:
mov 0x48(%rbp),%rax # read once
mov 0x20(%rsp),%rcx
mov $0x20,%edx
cmp %rax,%rcx
cmovbe %rcx,%rax
mov %rax,(%rsp)
mov %rax,%rcx
mov %r14,%rax
Where 0x48(%rbp) is inode->i_size stored to %eax.
]
Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.
Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out code handling revalidation of bdev on disk change into a
common helper.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl290UsACgkQiiy9cAdy
T1GVTAv+Mga+91Nw8Nte0Ix3ynuitDsqjtj6jIJs2FHoOI8cO1RhplU16elxS1OQ
y3AekBU/go2aWWraTPtGiZZReIPm0gyku11lK8zox3zEE9buFFR0dHvZgxll2gG8
IHJNgn76avvs+gI4XLeITzpwcv8Xt+z9VN1A0vujDSfSg3TeMEIyr6ofnFSgo9jx
2SRmCAMcgBameUlZWkc4fdz66GLguXhnYAZ7paX1mMLPuEsEmvHquU691+sKqDej
Q2GarzDR3JVusNIiuJtlwJlUprKAQuGuF0h6B9raZ0saoyR3MFr2bUkxNqDMPj4T
9BTeRItnPWcxh+q7bfvJi9LiHTP2tevoXZhqafd17hYRj3noXyw0FRLsKmYDccW2
Q4+PjOiv/Qyxg8g6l/Bw87VowYrzvVPxfcFMt8fC+tijX9XhdbzF/kSwD83jy/Vm
u14Eps2UEdaO7qiNZDRNSk1DyFePwCUq55ZMx27MbYfqu8RHXV5NvSJw/P7WQEF7
rAB7Cvy6
=Oh9N
-----END PGP SIGNATURE-----
Merge tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"A small smb3 memleak fix"
* tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
fix memory leak in large read decrypt offload
The callback function of call_rcu() just calls kfree(), so we can use
kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable bugfixes:
- Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
Other fixes:
- The TCP back channel mustn't disappear while requests are outstanding
- The RDMA back channel mustn't disappear while requests are outstanding
- Destroy the back channel when we destroy the host transport
- Don't allow a cached open with a revoked delegation
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl28mZ4ACgkQ18tUv7Cl
QOvAAxAA5h15NJ8OTOpr/kyWQAHEYWIEWQyTNYT8Gi/TI/YCfauET6kxAKg7ETk+
Bx6+A7Q5OlKSqFZXf2UMMIHPnvH7Ar06uIof6DLjWxA9N/q57SNW0YszQCUhvWjp
4Ry7d9A2yNrCizEk/vTmY5zfFIo2S88AvwkQ6gLCEyfIn5REWQyQnS4SS3vL7MuS
E/iWBsXFV2C9/yON+zzZnC0ewMIPbtWA3/CsVI2zDKLsHKvBYOx6n7TbfTtMtc/p
mCdgLzdEEZlN0KOCzzMp7ziME8y1qUGX6eo17Jrr0ZBZIM7o5d7sYMA88Apu6dMg
MQM5iEEAf//kmHLWzQ6yg8XZEZrXFMOiGKYu0kWt4pVlqi78gemEfE6T8dClsB5m
P7jqn/4ntDlJPWPciImsVEzDOHlzlZBIXwZs1JjufeO/hIB570cs80PjeDQP0id8
Cx3yR7Hr1ldKzwSwpWrtIaqEPljJJw0jhxQ7YYlvfbA1Ogd4ek4QU4RuvyW11CRr
d/5Oaa7xvjGs63klu8HpZG4NHFby3PPmZT6t4xUcH75MR14S15YCGtaUvPJ7ORck
C6tZJAkVtJqSoAjsGuBtTc5qiOdo+hMRghutW4Pd6UpnPYJKwQUtTqp3MbfM4+bA
S19tGG6qn+7cl7bW4NJpNy8HfbC7BdS+m7maHpQfGIdMsxKVFoI=
=ttcU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"This contains two delegation fixes (with the RCU lock leak fix marked
for stable), and three patches to fix destroying the the sunrpc back
channel.
Stable bugfixes:
- Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
Other fixes:
- The TCP back channel mustn't disappear while requests are
outstanding
- The RDMA back channel mustn't disappear while requests are
outstanding
- Destroy the back channel when we destroy the host transport
- Don't allow a cached open with a revoked delegation"
* tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
NFSv4: Don't allow a cached open with a revoked delegation
SUNRPC: Destroy the back channel when we destroy the host transport
SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl28lRQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjtbEADbrMXdsdPV2CApxSiZIaWO1mR78yy/btxp
cHsJ+avaPGxhNukSsose2KWm656SriH/OfQspqtvzDpslbu40V41+vSqSknqGRPr
8jW5efZIAy6dq0FjbtBnmIV6PhC5d4F/nAEQbsnVRn8RSr3OwQcm8/smpSFA8urI
oHVU8jiyLsQiSbDvjf2KPhPYhWBHO0W5SyGo29HY8pSzQpsMzGkQ6TcHL4EzwPZP
WPtGglr14v8rMyhNMxUdHZ9eHCMq7uufFPuyJXzesE/qyM+H8p2pwwxyfflHGZil
w2vxLJRu8d4UIkHEkNbC0bydXJ+eCtRMBZON1ZGdrZwQ58L9AbBPBZmxKb0LkmHb
4tc/yQm/0kSUUXwFtDoUoIBFjjy36Pl5BsLt4n5fofsl04myhm5CLqZ8oWxyU0vO
sCinJwk1+eQO/tbQVDfven+MroNlYVPCnXhIe/12/wEba3EJ7Ab4X5p0lJoJ1oY7
9dQyY6+BaHd4wV9p0domOP5y7dJnXM9k46EF0/5YoNjoqaH5MWPMq355VH2xNjdw
5HzRcZfvOAlXASrnXuQAAQAdR2b+s/iFZaNKA7bTZxjNPvYE0zySCMeQeNXmfVKe
CrDuwViWukwIzETDZHYqMWJxOV4nyOeL3jTo7rQp5A5TEWwBiJKQ4aGBif2eqc+L
Mk41ziQGuQ==
=+rar
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191101' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Two small nvme fixes, one is a fabrics connection fix, the other one
a cleanup made possible by that fix (Anton, via Keith)
- Fix requeue handling in umb ubd (Anton)
- Fix spin_lock_irq() nesting in blk-iocost (Dan)
- Three small io_uring fixes:
- Install io_uring fd after done with ctx (me)
- Clear ->result before every poll issue (me)
- Fix leak of shadow request on error (Pavel)
* tag 'for-linus-20191101' of git://git.kernel.dk/linux-block:
iocost: don't nest spin_lock_irq in ioc_weight_write()
io_uring: ensure we clear io_kiocb->result before each issue
um-ubd: Entrust re-queue to the upper layers
nvme-multipath: remove unused groups_only mode in ana log
nvme-multipath: fix possible io hang after ctrl reconnect
io_uring: don't touch ctx in setup after ring fd install
io_uring: Fix leaked shadow_req
A typo in nfs4_refresh_delegation_stateid() means we're leaking an
RCU lock, and always returning a value of 'false'. As the function
description states, we were always supposed to return 'true' if a
matching delegation was found.
Fixes: 12f275cdd1 ("NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the delegation is marked as being revoked, we must not use it
for cached opens.
Fixes: 869f9dfa4d ("NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We didn't use -ERESTARTSYS to tell the application layer to restart the
system call, but instead return -EINTR. we can set -EINTR directly when
wakeup by the signal, which can help us save an assignment operation and
comparison operation.
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.
To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.
Reset ->result whenever we re-prep a request for polled submission.
Cc: stable@vger.kernel.org
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reported an issue where we crash at setup time if failslab is
used. The issue is that io_wq_create() returns an error pointer on
failure, not NULL. Hence io_uring thought the io-wq was setup just
fine, but in reality it's a garbage error pointer.
Use IS_ERR() instead of a NULL check, and assign ret appropriately.
Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When gfs2 was converted to use fs_context, the initialisation of the
mount args structure to the currently active args was lost with the
removal of gfs2_remount_fs(), so the checks of the new args on remount
became checks against the default values instead of the current ones.
This caused unexpected remount behaviour and test failures (xfstests
generic/294, generic/306 and generic/452).
Reinstate the args initialisation, this time in gfs2_init_fs_context()
and conditional upon fc->purpose, as that's the only time we get control
before the mount args are parsed in the remount process.
Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If we get -1 from hrtimer_try_to_cancel(), we know that the timer
is running. Hence leave all completion to the timeout handler. If
we don't, we can corrupt the list and miss a completion.
Fixes: 11365043e5 ("io_uring: add support for canceling timeout requests")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not play with dcache without parent locked...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For RCU case ->d_revalidate() is called with rcu_read_lock() and
without pinning the dentry passed to it. Which means that it
can't rely upon ->d_inode remaining stable; that's the reason
for d_inode_rcu(), actually.
Make sure we don't reload ->d_inode there.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
KASAN reports a use-after-free when running xfstest generic/531, with the
following trace:
[ 293.903362] kasan_report+0xe/0x20
[ 293.903365] rb_erase+0x1f/0x790
[ 293.903370] __ceph_remove_cap+0x201/0x370
[ 293.903375] __ceph_remove_caps+0x4b/0x70
[ 293.903380] ceph_evict_inode+0x4e/0x360
[ 293.903386] evict+0x169/0x290
[ 293.903390] __dentry_kill+0x16f/0x250
[ 293.903394] dput+0x1c6/0x440
[ 293.903398] __fput+0x184/0x330
[ 293.903404] task_work_run+0xb9/0xe0
[ 293.903410] exit_to_usermode_loop+0xd3/0xe0
[ 293.903413] do_syscall_64+0x1a0/0x1c0
[ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens because __ceph_remove_cap() may queue a cap release
(__ceph_queue_cap_release) which can be scheduled before that cap is
removed from the inode list with
rb_erase(&cap->ci_node, &ci->i_caps);
And, when this finally happens, the use-after-free will occur.
This can be fixed by removing the cap from the inode list before being
removed from the session list, and thus eliminating the risk of an UAF.
Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
There's been a few requests for supporting more fixed files than 1024.
This isn't really tricky to do, we just need to split up the file table
into multiple tables and index appropriately. As we do so, reduce the
max single file table to 512. This enables us to do single page allocs
always for the tables, which is an improvement over the situation prior.
This patch adds support for up to 64K files, which should be enough for
everyone.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We index the file tables with a user given value. After we check
it's within our limits, use array_index_nospec() to prevent any
spectre attacks here.
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).
If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.
Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop various work-arounds we have for workqueues:
- We no longer need the async_list for tracking sequential IO.
- We don't have to maintain our own mm tracking/setting.
- We don't need a separate workqueue for buffered writes. This didn't
even work that well to begin with, as it was suboptimal for multiple
buffered writers on multiple files.
- We can properly cancel pending interruptible work. This fixes
deadlocks with particularly socket IO, where we cannot cancel them
when the io_uring is closed. Hence the ring will wait forever for
these requests to complete, which may never happen. This is different
from disk IO where we know requests will complete in a finite amount
of time.
- Due to being able to cancel work interruptible work that is already
running, we can implement file table support for work. We need that
for supporting system calls that add to a process file table.
- It gets us one step closer to adding async support for any system
call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:
- We can assign memory context smarter and more persistently if we
manage the life time of threads.
- We can drop various work-arounds we have in io_uring, like the
async_list.
- We can implement hashed work insertion, to manage concurrency of
buffered writes without needing a) an extra workqueue, or b)
needlessly making the concurrency of said workqueue very low
which hurts performance of multiple buffered file writers.
- We can implement cancel through signals, for cancelling
interruptible work like read/write (or send/recv) to/from sockets.
- We need the above cancel for being able to assign and use file tables
from a process.
- We can implement a more thorough cancel operation in general.
- We need it to move towards a syslet/threadlet model for even faster
async execution. For that we need to take ownership of the used
threads.
This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.
io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXbgx9wAKCRDh3BK/laaZ
PIjMAQCTrWPf6pLHSoq+Pll7b8swu1m2mW97fj5k8coED1/DSQEA4222PkIdMmhu
qyHnoX0fTtYXg6NnHbDVWsNL4uG8YAM=
=RE9K
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Mostly virtiofs fixes, but also fixes a regression and couple of
longstanding data/metadata writeback ordering issues"
* tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: redundant get_fuse_inode() calls in fuse_writepages_fill()
fuse: Add changelog entries for protocols 7.1 - 7.8
fuse: truncate pending writes on O_TRUNC
fuse: flush dirty data/metadata before non-truncate setattr
virtiofs: Remove set but not used variable 'fc'
virtiofs: Retry request submission from worker context
virtiofs: Count pending forgets as in_flight forgets
virtiofs: Set FR_SENT flag only after request has been sent
virtiofs: No need to check fpq->connected state
virtiofs: Do not end request in submission context
fuse: don't advise readdirplus for negative lookup
fuse: don't dereference req->args on finished request
virtio-fs: don't show mount options
virtio-fs: Change module name to virtiofs.ko
Commit fb5ccc9878 ("io_uring: Fix broken links with offloading")
introduced a potential performance regression with unconditionally
taking mm even for READ/WRITE_FIXED operations.
Return the logic handling it back. mm-faulted requests will go through
the generic submission path, so honoring links and drains, but will
fail further on req->has_user check.
Fixes: fb5ccc9878 ("io_uring: Fix broken links with offloading")
Cc: stable@vger.kernel.org # v5.4
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
submit->index is used only for inbound check in submission path (i.e.
head < ctx->sq_entries). However, it always will be true, as
1. it's already validated by io_get_sqring()
2. ctx->sq_entries can't be changedd in between, because of held
ctx->uring_lock and ctx->refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.
All such events could be roughly divided into two categories:
* those, that are helping to understand correctness (from both kernel
and an application point of view). E.g. a ring creation, file
registration, or waiting for available CQE. Proposed approach is to
get a pointer to an original structure of interest (ring context, or
request), and then find relevant events. io_uring_queue_async_work
also exposes a pointer to work_struct, to be able to track down
corresponding workqueue events.
* those, that provide performance related information. Mostly it's about
events that change the flow of requests, e.g. whether an async work
was queued, or delayed due to some dependencies. Another important
case is how io_uring optimizations (e.g. registered files) are
utilized.
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.
Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.
If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:
struct io_uring_files_update {
__u32 offset;
__s32 *fds;
};
that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:
1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
replaced with ->fds[i].
For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.
This improves the performance of linked SQEs, and reduces the CPU
overhead.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.
Defer ring fd installation until after we're done reading those
values.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl208asACgkQiiy9cAdy
T1EVJAv41lwLL01FQtWsoMB8yn46q7iboc0eqVJ4CEnZj25Fe3VgokKm1+2N6/15
g8tQeLnE8kzjQyAxEgI260PHRF6Dg0RSL7zwxnJOFpspWLb9+U4UEryCDao/v4Ao
LYr88F6XeYDGojyuGKokT/jlsxWfdPwaYhpOPabjO76f6Jv7cNfIa9EKzbEAtq3e
a9to2AVnJ02FLGTP3QM1ThBtCu3JFFMI07u2Rp8ANiJYtM5Mg2piOkO1PQ3hGLIM
/3+1E7B12KgG7D3GA7ZuP1HLjyo+A91BhZVTzlBDq1VHTd8ccJDJklg28oXoMrMh
dgW5XRcS/s60LsRzagpNMC0cPlAdBAd9N9uZaoXmQhNAg0I7CJ24H3tglwKejoUR
m/duI1C89Fbo1kgjwfAztxhfvair2SYLu8DGijvYH/C5syNUR7V99uy1th+nmOgl
vpBM9CbkM6mtSxFHtgcFuv4rwIf61EQaGoXiL/e3RLH4zc/HaUpBo2rDh0LPmwFB
koXIw6E=
=fTOk
-----END PGP SIGNATURE-----
Merge tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Seven cifs/smb3 fixes, including three for stable"
* tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
CIFS: Fix use after free of file info structures
CIFS: Fix retry mid list corruption on reconnects
cifs: Fix missed free operations
CIFS: avoid using MID 0xFFFF
cifs: clarify comment about timestamp granularity for old servers
cifs: Handle -EINPROGRESS only when noblockcnt is set
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl20ST0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprl3D/wJo392Nd3fDa1+t9XpV3M5T/HKcC7yau/G
98r41DF8TzjlwWFpsbolbBibeUkwpkZr32rAyEN7UScw6d5I+qlGF/ATzZijWGvd
a714VhMq3fs8siGLDtNxdclYlSXzF5wnhbniPdF1QJsEHK7wAbqXV6/xf5FubHke
SjYX9HOt0PGfkdDWmdMLRlaaedUZG5RPWAZV4dtg8LWMcL9yu9FZtxbyYyhFXf94
Wd3MxvDp4/aeudO+CLkI73H0LOl5f0iuIOAOTEDE39X0zQA3ezwYiBh8wRtrBh1Y
DJus7LI0G56DdUOengyptFkR2qvjmQpJr6mFKzVlp9bRJhwuvzcJ1MWj23oDDFOx
xW1m74rQcP2COZ2BIvzufKBrBJn8Gq5BXDjQAHgQMNn0FQbEnBnCQ+vA3V4NkzMs
EuUgk2P/LAwBpldkhFUg78UZUKCxiV72cTihLZJHN2FW5cf652Hf9jaA5Vnrzye5
gJvXmjWojDmEu77lP51UdKSk4Jhhf+Mr/FM3tNoCyteXpceSIrDE8gOMwhkj3oM4
eXVgHKSyxrD1npcE2/C7wwOMP9V0FBCFb5ulDbuvRyhZRQeSXtugbzU5Pn9VqgiK
e7uhXA7AqrKS0bdHdXXlxs0of1gvq42zYcD8A9KFJhUxU07yUKXmIwXO4hApAMsk
VLxsoPdIMA==
=PhEL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block
Pull block and io_uring fixes from Jens Axboe:
"A bit bigger than usual at this point in time, mostly due to some good
bug hunting work by Pavel that resulted in three io_uring fixes from
him and two from me. Anyway, this pull request contains:
- Revert of the submit-and-wait optimization for io_uring, it can't
always be done safely. It depends on commands always making
progress on their own, which isn't necessarily the case outside of
strict file IO. (me)
- Series of two patches from me and three from Pavel, fixing issues
with shared data and sequencing for io_uring.
- Lastly, two timeout sequence fixes for io_uring (zhangyi)
- Two nbd patches fixing races (Josef)
- libahci regulator_get_optional() fix (Mark)"
* tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block:
nbd: verify socket is supported during setup
ata: libahci_platform: Fix regulator_get_optional() misuse
nbd: handle racing with error'ed out commands
nbd: protect cmd->status with cmd->lock
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
io_uring: used cached copies of sq->dropped and cq->overflow
io_uring: Fix race for sqes with userspace
io_uring: Fix broken links with offloading
io_uring: Fix corrupted user_data
io_uring: correct timeout req sequence when inserting a new entry
io_uring : correct timeout req sequence when waiting timeout
io_uring: revert "io_uring: optimize submit_and_wait API"
- Fix a performance regression that followed from a fix to the
conversion of the fsdax implementation to the xarray. v5.3 users
report that they stop seeing huge page mappings on an application +
filesystem layout that was seeing huge pages previously on v5.2.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJds5qfAAoJEB7SkWpmfYgCD6gP/1PuAbdRnTt/pEUB5GLtOR21
hszHP5vFiWaiDN4yUw7IknYmOWlPVPoHg8SdSwKLM75EGj9JCfQGUcUTZIREB15h
6MnAavVr3u697ZGfnqg8y2Nhg6hNKbqzRnzCEACAXe+O2GnH2TuJRauyS6Zr33St
dXiJ+XQbTGoUbR0+CEDIdv0DnkUFaaE/m1+O5DbApcP56Gt/4E1PvlIPJjMog72/
WKVk4/Cv9+gc7JZMjR+GjjXJafiNaXT1btPGzF6zeHe9XrAnkxvfjm9BSh+qeCxj
19jOKKFs+bY5O21jczEIvgE4cpPQEytYhZijPwMXFt2o3betOH4ACMDxVMuNJKag
Q3yd8QJRqtP5xALAxeby7HoD6gdVLLcH0qjTdMt+LHvSde5OH3l8qOfWjK5CoIdx
uiUC+ugg8FDZLTsliI5qPaDCSYhswoyLlvIF3AIqllB7Y1lUbihw5h+gXr9aLGmu
1SUA6ipPw1gUW5ikYlKTu5Lzxa2VN8vHBv03W1mjuj2nOggRWp8cK8CmwKpnUuOp
9xD1WB0dl23MRNRZOKQvkrg8kOL2vuowwJYQ6Kl3Tc2vWvqYfNRQW6dSwgx3gF38
rQ0ZQddet4EMATrxbwMalU5hVDV2w5LAUz0+bNyovb7S0oZ20wLHm5o5iImppV1U
+K2aRqwaPi/ytcwRF+iF
=LioO
-----END PGP SIGNATURE-----
Merge tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fix from Dan Williams:
"Fix a performance regression that followed from a fix to the
conversion of the fsdax implementation to the xarray. v5.3 users
report that they stop seeing huge page mappings on an application +
filesystem layout that was seeing huge pages previously on v5.2"
* tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
fs/dax: Fix pmd vs pte conflict detection
According to commit message in the original commit c75b1d9421 ("fs:
add fcntl() interface for setting/getting write life time hints"),
as well as userspace library[1] and man page update[2], R/W hint constants
are intended to have RWH_* prefix. However, RWF_WRITE_LIFE_NOT_SET retained
"RWF_*" prefix used in the early versions of the proposed patch set[3].
Rename it and provide the old name as a synonym for the new one
for backward compatibility.
[1] https://github.com/axboe/fio/commit/bd553af6c849
[2] https://github.com/mkerrisk/man-pages/commit/580082a186fd
[3] https://www.mail-archive.com/linux-block@vger.kernel.org/msg09638.html
Fixes: c75b1d9421 ("fs: add fcntl() interface for setting/getting write life time hints")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e133 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.
For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe
Reorder them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.
The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().
Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.
The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
if the second call of should_expire() in there ends up
grabbing and returning a new reference to dentry, we need
to drop it before continuing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's a deadlock that is possible and can easily be seen with
a test where multiple readers open/read/close of the same file
and a disruption occurs causing reconnect. The deadlock is due
a reader thread inside cifs_strict_readv calling down_read and
obtaining lock_sem, and then after reconnect inside
cifs_reopen_file calling down_read a second time. If in
between the two down_read calls, a down_write comes from
another process, deadlock occurs.
CPU0 CPU1
---- ----
cifs_strict_readv()
down_read(&cifsi->lock_sem);
_cifsFileInfo_put
OR
cifs_new_fileinfo
down_write(&cifsi->lock_sem);
cifs_reopen_file()
down_read(&cifsi->lock_sem);
Fix the above by changing all down_write(lock_sem) calls to
down_write_trylock(lock_sem)/msleep() loop, which in turn
makes the second down_read call benign since it will never
block behind the writer while holding lock_sem.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed--by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Currently the code assumes that if a file info entry belongs
to lists of open file handles of an inode and a tcon then
it has non-zero reference. The recent changes broke that
assumption when putting the last reference of the file info.
There may be a situation when a file is being deleted but
nothing prevents another thread to reference it again
and start using it. This happens because we do not hold
the inode list lock while checking the number of references
of the file info structure. Fix this by doing the proper
locking when doing the check.
Fixes: 487317c994 ("cifs: add spinlock for the openFileList to cifsInodeInfo")
Fixes: cb248819d2 ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
Cc: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When the client hits reconnect it iterates over the mid
pending queue marking entries for retry and moving them
to a temporary list to issue callbacks later without holding
GlobalMid_Lock. In the same time there is no guarantee that
mids can't be removed from the temporary list or even
freed completely by another thread. It may cause a temporary
list corruption:
[ 430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469
[ 430.464668] ------------[ cut here ]------------
[ 430.466569] kernel BUG at lib/list_debug.c:51!
[ 430.468476] invalid opcode: 0000 [#1] SMP PTI
[ 430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19
[ 430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
...
[ 430.510426] Call Trace:
[ 430.511500] cifs_reconnect+0x25e/0x610 [cifs]
[ 430.513350] cifs_readv_from_socket+0x220/0x250 [cifs]
[ 430.515464] cifs_read_from_socket+0x4a/0x70 [cifs]
[ 430.517452] ? try_to_wake_up+0x212/0x650
[ 430.519122] ? cifs_small_buf_get+0x16/0x30 [cifs]
[ 430.521086] ? allocate_buffers+0x66/0x120 [cifs]
[ 430.523019] cifs_demultiplex_thread+0xdc/0xc30 [cifs]
[ 430.525116] kthread+0xfb/0x130
[ 430.526421] ? cifs_handle_standard+0x190/0x190 [cifs]
[ 430.528514] ? kthread_park+0x90/0x90
[ 430.530019] ret_from_fork+0x35/0x40
Fix this by obtaining extra references for mids being retried
and marking them as MID_DELETED which indicates that such a mid
has been dequeued from the pending list.
Also move mid cleanup logic from DeleteMidQEntry to
_cifs_mid_q_entry_release which is called when the last reference
to a particular mid is put. This allows to avoid any use-after-free
of response buffers.
The patch needs to be backported to stable kernels. A stable tag
is not mentioned below because the patch doesn't apply cleanly
to any actively maintained stable kernel.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-and-tested-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
gfs2 and gfs2meta share an ->init_fs_context function which allocates an
args structure stored in fc->fs_private. gfs2 registers a ->free
function to free this memory when the fs_context is cleaned up, but
there was not one registered for gfs2meta, causing a leak.
Register a ->free function for gfs2meta. The existing gfs2_fc_free
function does what we need.
Reported-by: syzbot+c2fdfd2b783754878fb6@syzkaller.appspotmail.com
Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.
req->sequence
req_1 (count = 2): 2
req_2 (count = 1): 2
Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.
This reverts two commits:
bf7ec93c64c576666863
Fixes: c576666863 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
KUnit tests for decoding extended 64 bit timestamps that verify the
seconds part of [a/c/m] timestamps in ext4 inode structs are decoded
correctly.
Test data is derived from the table in the Inode Timestamps section of
Documentation/filesystems/ext4/inodes.rst.
KUnit tests run during boot and output the results to the debug log
in TAP format (http://testanything.org/). Only useful for kernel devs
running KUnit test harness and are not for inclusion into a production
build.
Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Currently fuse_writepages_fill() calls get_fuse_inode() few times with
the same argument.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Make sure cached writes are not reordered around open(..., O_TRUNC), with
the obvious wrong results.
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If writeback cache is enabled, then writes might get reordered with
chmod/chown/utimes. The problem with this is that performing the write in
the fuse daemon might itself change some of these attributes. In such case
the following sequence of operations will result in file ending up with the
wrong mode, for example:
int fd = open ("suid", O_WRONLY|O_CREAT|O_EXCL);
write (fd, "1", 1);
fchown (fd, 0, 0);
fchmod (fd, 04755);
close (fd);
This patch fixes this by flushing pending writes before performing
chown/chmod/utimes.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fixes: 4d99ff8f12 ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt
WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv
eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3
l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32
VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+
MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI
1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4
MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR
dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2
HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2
pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu
yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0=
=Ft+G
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixes of error handling cleanup of metadata accounting with qgroups
enabled
- fix swapped values for qgroup tracepoints
- fix race when handling full sync flag
- don't start unused worker thread, functionality removed already
* tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: check for the full sync flag while holding the inode lock during fsync
Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
btrfs: tracepoints: Fix bad entry members of qgroup events
btrfs: tracepoints: Fix wrong parameter order for qgroup events
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
btrfs: don't needlessly create extent-refs kernel thread
btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
Btrfs: add missing extents release on file extent cluster relocation error
Fixes gcc '-Wunused-but-set-variable' warning:
fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock:
fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable]
It is not used since commit 7ee1e2e631 ("virtiofs: No need to check
fpq->connected state")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Users reported a v5.3 performance regression and inability to establish
huge page mappings. A revised version of the ndctl "dax.sh" huge page
unit test identifies commit 23c84eb783 "dax: Fix missed wakeup with
PMD faults" as the source.
Update get_unlocked_entry() to check for NULL entries before checking
the entry order, otherwise NULL is misinterpreted as a present pte
conflict. The 'order' check needs to happen before the locked check as
an unlocked entry at the wrong order must fallback to lookup the correct
order.
Reported-by: Jeff Smits <jeff.smits@intel.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Cc: <stable@vger.kernel.org>
Fixes: 23c84eb783 ("dax: Fix missed wakeup with PMD faults")
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This type is used to pass the sigset_t from userland to the kernel,
but it was using the kernel native pointer type for the member
representing the compat userland pointer to the userland sigset_t.
This messes up the layout, and makes the kernel eat up both the
userland pointer and the size members into the kernel pointer, and
then reads garbage into the kernel sigsetsize. Which makes the sigset_t
size consistency check fail, and consequently the syscall always
returns -EINVAL.
This breaks both libaio and strace on 32-bit userland running on 64-bit
kernels. And there are apparently no users in the wild of the current
broken layout (at least according to codesearch.debian.org and a brief
check over github.com search). So it looks safe to fix this directly
in the kernel, instead of either letting userland deal with this
permanently with the additional overhead or trying to make the syscall
infer what layout userland used, even though this is also being worked
around in libaio to temporarily cope with kernels that have not yet
been fixed.
We use a proper compat_uptr_t instead of a compat_sigset_t pointer.
Fixes: 7a074e96de ("aio: implement io_pgetevents")
Signed-off-by: Guillem Jover <guillem@hadrons.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
memset the struct fscrypt_info to zero before freeing. This isn't
really needed currently, since there's no secret key directly in the
fscrypt_info. But there's a decent chance that someone will add such a
field in the future, e.g. in order to use an API that takes a raw key
such as siphash(). So it's good to do this as a hardening measure.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now that ext4 and f2fs implement their own post-read workflow that
supports both fscrypt and fsverity, the fscrypt-only workflow based
around struct fscrypt_ctx is no longer used. So remove the unused code.
This is based on a patch from Chandan Rajendra's "Consolidate FS read
I/O callbacks code" patchset, but rebased onto the latest kernel, folded
__fscrypt_decrypt_bio() into fscrypt_decrypt_bio(), cleaned up
fscrypt_initialize(), and updated the commit message.
Originally-from: Chandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Instead of open-coding the calculations for ESSIV handling, use an ESSIV
skcipher which does all of this under the hood. ESSIV was added to the
crypto API in v5.4.
This is based on a patch from Ard Biesheuvel, but reworked to apply
after all the fscrypt changes that went into v5.4.
Tested with 'kvm-xfstests -c ext4,f2fs -g encrypt', including the
ciphertext verification tests for v1 and v2 encryption policies.
Originally-from: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
If regular request queue gets full, currently we sleep for a bit and
retrying submission in submitter's context. This assumes submitter is not
holding any spin lock. But this assumption is not true for background
requests. For background requests, we are called with fc->bg_lock held.
This can lead to deadlock where one thread is trying submission with
fc->bg_lock held while request completion thread has called
fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As
request completion thread gets blocked, it does not make further progress
and that means queue does not get empty and submitter can't submit more
requests.
To solve this issue, retry submission with the help of a worker, instead of
retrying in submitter's context. We already do this for hiprio/forget
requests.
Reported-by: Chirantan Ekbote <chirantan@chromium.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If virtqueue is full, we put forget requests on a list and these forgets
are dispatched later using a worker. As of now we don't count these forgets
in fsvq->in_flight variable. This means when queue is being drained, we
have to have special logic to first drain these pending requests and then
wait for fsvq->in_flight to go to zero.
By counting pending forgets in fsvq->in_flight, we can get rid of special
logic and just wait for in_flight to go to zero. Worker thread will kick
and drain all the forgets anyway, leading in_flight to zero.
I also need similar logic for normal request queue in next patch where I am
about to defer request submission in the worker context if queue is full.
This simplifies the code a bit.
Also add two helper functions to inc/dec in_flight. Decrement in_flight
helper will later used to call completion when in_flight reaches zero.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FR_SENT flag should be set when request has been sent successfully sent
over virtqueue. This is used by interrupt logic to figure out if interrupt
request should be sent or not.
Also add it to fqp->processing list after sending it successfully.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In virtiofs we keep per queue connected state in virtio_fs_vq->connected
and use that to end request if queue is not connected. And virtiofs does
not even touch fpq->connected state.
We probably need to merge these two at some point of time. For now,
simplify the code a bit and do not worry about checking state of
fpq->connected.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Submission context can hold some locks which end request code tries to hold
again and deadlock can occur. For example, fc->bg_lock. If a background
request is being submitted, it might hold fc->bg_lock and if we could not
submit request (because device went away) and tried to end request, then
deadlock happens. During testing, I also got a warning from deadlock
detection code.
So put requests on a list and end requests from a worker thread.
I got following warning from deadlock detector.
[ 603.137138] WARNING: possible recursive locking detected
[ 603.137142] --------------------------------------------
[ 603.137144] blogbench/2036 is trying to acquire lock:
[ 603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse]
[ 603.140701]
[ 603.140701] but task is already holding lock:
[ 603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse]
[ 603.140713]
[ 603.140713] other info that might help us debug this:
[ 603.140714] Possible unsafe locking scenario:
[ 603.140714]
[ 603.140715] CPU0
[ 603.140716] ----
[ 603.140716] lock(&(&fc->bg_lock)->rlock);
[ 603.140718] lock(&(&fc->bg_lock)->rlock);
[ 603.140719]
[ 603.140719] *** DEADLOCK ***
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If the FUSE_READDIRPLUS_AUTO feature is enabled, then lookups on a
directory before/during readdir are used as an indication that READDIRPLUS
should be used instead of READDIR. However if the lookup turns out to be
negative, then selecting READDIRPLUS makes no sense.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move the check for async request after check for the request being already
finished and done with.
Reported-by: syzbot+ae0bb7aae3de6b4594e2@syzkaller.appspotmail.com
Fixes: d49937749f ("fuse: stop copying args to fuse_req")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
cifs_setattr_nounix has two paths which miss free operations
for xid and fullpath.
Use goto cifs_setattr_exit like other paths to fix them.
CC: Stable <stable@vger.kernel.org>
Fixes: aa081859b1 ("cifs: flush before set-info if we have writeable handles")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
According to MS-CIFS specification MID 0xFFFF should not be used by the
CIFS client, but we actually do. Besides, this has proven to cause races
leading to oops between SendReceive2/cifs_demultiplex_thread. On SMB1,
MID is a 2 byte value easy to reach in CurrentMid which may conflict with
an oplock break notification request coming from server
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
It could be confusing why we set granularity to 1 seconds rather
than 2 seconds (1 second is the max the VFS allows) for these
mounts to very old servers ...
Signed-off-by: Steve French <stfrench@microsoft.com>
We only want to avoid blocking in connect when mounting SMB root
filesystems, otherwise bail out from generic_ip_connect() so cifs.ko
can perform any reconnect failover appropriately.
This fixes DFS failover/reconnection tests in upstream buildbot.
Fixes: 8eecd1c2e5 ("cifs: Add support for root file systems")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Merge misc fixes from Andrew Morton:
"Rather a lot of fixes, almost all affecting mm/"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
scripts/gdb: fix debugging modules on s390
kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
mm/thp: allow dropping THP from page cache
mm/vmscan.c: support removing arbitrary sized pages from mapping
mm/thp: fix node page state in split_huge_page_to_list()
proc/meminfo: fix output alignment
mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch
mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition
mm: include <linux/huge_mm.h> for is_vma_temporary_stack
zram: fix race between backing_dev_show and backing_dev_store
mm/memcontrol: update lruvec counters in mem_cgroup_move_account
ocfs2: fix panic due to ocfs2_wq is null
hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()
mm: memblock: do not enforce current limit for memblock_phys* family
mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
mm/gup: fix a misnamed "write" argument, and a related bug
mm/gup_benchmark: add a missing "w" to getopt string
ocfs2: fix error handling in ocfs2_setattr()
mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
mm/memunmap: don't access uninitialized memmap in memunmap_pages()
...
Patch series "Fixes for THP in page cache", v2.
This patch (of 5):
Add extra space for FileHugePages and FilePmdMapped, so the output is
aligned with other rows.
Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before
jumping to do dqput().
Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are three places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup
We have initialized memmaps either when the section is online or when the
page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.
For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:
:/# cat /proc/kpagecount > tmp.test
BUG: unable to handle page fault for address: fffffffffffffffe
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
RIP: 0010:kpagecount_read+0xce/0x1e0
Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
Call Trace:
proc_reg_read+0x3c/0x60
vfs_read+0xc5/0x180
ksys_read+0x68/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
For now, let's drop support for ZONE_DEVICE from the three pseudo files
in order to fix this. To distinguish offline memory (with garbage
memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
would have to check get_dev_pagemap() and pfn_zone_device_reserved()
right now. The usage of both (especially, special casing devmem) is
frowned upon and needs to be reworked.
The fundamental issue we have is:
if (pfn_to_online_page(pfn)) {
/* memmap initialized */
} else if (pfn_valid(pfn)) {
/*
* ???
* a) offline memory. memmap garbage.
* b) devmem: memmap initialized to ZONE_DEVICE.
* c) devmem: reserved for driver. memmap garbage.
* (d) devmem: memmap currently initializing - garbage)
*/
}
We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
in place as that function is also used from memory failure. We now no
longer dump information about pages that are not in use anymore -
offline.
Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
E+DK7if0Gg==
=rvkw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Keith that address deadlocks, double resets,
memory leaks, and other regression.
- Fixup elv_support_iosched() for bio based devices (Damien)
- Fixup for the ahci PCS quirk (Dan)
- Socket O_NONBLOCK handling fix for io_uring (me)
- Timeout sequence io_uring fixes (yangerkun)
- MD warning fix for parameter default_layout (Song)
- blkcg activation fixes (Tejun)
- blk-rq-qos node deletion fix (Tejun)
* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
nvme-pci: Set the prp2 correctly when using more than 4k page
io_uring: fix logic error in io_timeout
io_uring: fix up O_NONBLOCK handling for sockets
md/raid0: fix warning message for parameter default_layout
libata/ahci: Fix PCS quirk application
blk-rq-qos: fix first node deletion of rq_qos_del()
blkcg: Fix multiple bugs in blkcg_activate_policy()
io_uring: consider the overflow of sequence for timeout req
nvme-tcp: fix possible leakage during error flow
nvmet-loop: fix possible leakage during error flow
block: Fix elv_support_iosched()
nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
nvme: Wait for reset state when required
nvme: Prevent resets during paused controller state
nvme: Restart request timers in resetting state
nvme: Remove ADMIN_ONLY state
nvme-pci: Free tagset if no IO queues
nvme: retain split access workaround for capability reads
nvme: fix possible deadlock when nvme_update_formats fails
This was always meant to be a temporary thing, just for testing and to
see if it actually ever triggered.
The only thing that reported it was syzbot doing disk image fuzzing, and
then that warning is expected. So let's just remove it before -rc4,
because the extra sanity testing should probably go to -stable, but we
don't want the warning to do so.
Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com
Fixes: 8a23eb804c ("Make filldir[64]() verify the directory entry filename is valid")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a patch for a mostly benign race from Dongsheng.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl2qAEkTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi4mZB/9HMfEfZ8JC9keyVaJpAyvV8ufTR4qs
4b8NNc0MDM01z1Z23G0o89b5M0WEDcGslh25plCifxyNIMa+L/lYKl8CTr7CLVQS
qCEtNgJ7ibfM26v7rfHOlk6Nnd07/OmjcioaHu/R3bqEQmXpcWQg+aX9C6mPh/2f
yzZTKZdKhTZfUyQQctuhNo9G+wD8K86DYT1XRbubPNQ3VtXKPuNH9rLhvLCZzbVA
6FHW05A4mwSv80MsLgN6qLSKxv/+LjV/voHepH4HygqUKw2+1lwi9BC/4k7sprQs
1jFONZ0p1sv/LdWwJYUyCpwj6d3NliXM0uvYxfyzKveWWCxb3l3gaWUS
=7scd
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.4-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A future-proofing decoding fix from Jeff intended for stable and a
patch for a mostly benign race from Dongsheng"
* tag 'ceph-for-5.4-rc4' of git://github.com/ceph/ceph-client:
rbd: cancel lock_dwork if the wait is interrupted
ceph: just skip unrecognized info in ceph_reply_info_extra
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.
Fixes: 5da0fb1ab3 ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've got two issues with the non-regular file handling for non-blocking
IO:
1) We don't want to re-do a short read in full for a non-regular file,
as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
we need to punt to async context even if the file is opened as
non-blocking. Otherwise the caller always gets -EAGAIN.
Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.
Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a timestamp signedness problem in the new bulkstat ioctl.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2l7R8ACgkQ+H93GTRK
tOtFnw/9GyBxDqODfne2puAb3nISogzVDVRxzo2Ip4fH+djQ8XF0nZgaVmdK45xm
lD0VNCMT7tb7YA9rIKb+G8WdZDfq5XUs2QOR8RCqg2rCy5Kxww8B0e83ziJ/hz8C
ebmfuBpNoV3Ul8I4vPN2Qb2nN2wHo8LqMi3GgK3nmv92xuL7Fignh3AxForHSFRz
VUQF+canS5IQo2KkbKsP92X1r+nwTG/8GL3WNicm4BgeFHcZbMKJn32OSpZHUSGb
RPEpWD1TAROhjOejFpEqEXmU8x8o0BtErgo3dBgGQTntrq9gOPgOuoot3cyweHZK
JEtxglnIIbJs0B0sru9ZuN7RkoXZYJjDaTOU8uLst8dwVaoX2kw8siWjQimadKI4
aI323yHpTTjuzZa2PQRU682fNyVNwcDfQnOss28xQG1wG9Pjxo39AZqhpgR1GtX4
kgYBVr6oJ/5HrZ5KFghBnjE1+jx4F4axAhgKqmbsg712mZcr1EM4TlFfd/yX6+Sd
ihHlgSeIiZk+2fthnUwRvPuGSYIimtUAXY5wgRzXnqmCFxy3rPafLo41pPkCcYV4
wCHdh6YR1H0/wJ9K2iUMxKKrSdu4PcdaprlD7aX7j3u727gybqOqIhaK6GdlxQg8
zyUZIGQEyR6uTpbtC4xNttZ3OcR5w5u7JjIn6TR3xIws4TkVMwU=
=WVUx
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"The single fix converts the seconds field in the recently added XFS
bulkstat structure to a signed 64-bit quantity.
The structure layout doesn't change and so far there are no users of
the ioctl to break because we only publish xfs ioctl interfaces
through the XFS userspace development libraries, and we're still
working on a 5.3 release"
* tag 'xfs-5.4-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: change the seconds fields in xfs_bulkstat to signed
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.
Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.
So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.
Fixes: c8eaeac7b7 ("btrfs: reserve delalloc metadata differently")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
The diff should always be alinged to sector size (4k), so there is
definitely something wrong.
[CAUSE]
For the wrong @diff, it's caused by wrong parameter order.
The correct parameters are:
struct btrfs_root, s64 diff, int type.
However the parameters used are:
struct btrfs_root, int type, s64 diff.
Fixes: 4ee0d8832c ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After do_add_mount() returns success, the caller doesn't hold a
reference to the 'struct mount' anymore. So it's invalid to access it
in mnt_warn_timestamp_expiry().
Fix it by calling mnt_warn_timestamp_expiry() before do_add_mount()
rather than after, and adjusting the warning message accordingly.
Reported-by: syzbot+da4f525235510683d855@syzkaller.appspotmail.com
Fixes: f8b92ba67c ("mount: Add mount warning for impending timestamp expiry")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
64-bit time is a signed quantity in the kernel, so the bulkstat
structure should reflect that. Note that the structure size stays
the same and that we have not yet published userspace headers for this
new ioctl so there are no users to break.
Fixes: 7035f9724f ("xfs: introduce new v5 bulkstat structure")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In the future, we're going to want to extend the ceph_reply_info_extra
for create replies. Currently though, the kernel code doesn't accept an
extra blob that is larger than the expected data.
Change the code to skip over any unrecognized fields at the end of the
extra blob, rather than returning -EIO.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:
1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.
2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.
This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Virtio-fs does not accept any mount options, so it's confusing and wrong to
show any in /proc/mounts.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix kernel-doc warning in fs/libfs.c:
fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end'
Link: http://lkml.kernel.org/r/5fc9d70b-e377-0ec9-066a-970d49579041@infradead.org
Fixes: ad2a722f19 ("libfs: Open code simple_commit_write into only user")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Boaz Harrosh <boazh@netapp.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warning in fs/direct-io.c:
fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete'
Also, don't mark this function as having kernel-doc notation since it is
not exported.
Link: http://lkml.kernel.org/r/97908511-4328-4a56-17fe-f43a1d7aa470@infradead.org
Fixes: 6d544bb4d9 ("dio: centralize completion in dio_complete()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Zach Brown <zab@zabbo.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have been calling it virtio_fs and even file name is virtio_fs.c. Module
name is virtio_fs.ko but when registering file system user is supposed to
specify filesystem type as "virtiofs".
Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs"
and got this warning on console.
------------[ cut here ]------------
request_module fs-virtio_fs succeeded, but still no fs?
WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138
Modules linked in: ... virtio_fs fuse virtio_net net_failover ...
CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1
So looks like kernel could find the module virtio_fs.ko but could not find
filesystem type after that.
It probably is better to rename module name to virtiofs.ko so that above
warning goes away in case user ends up specifying wrong fs name.
Reported-by: Masayoshi Mizuma <msys.mizuma@gmail.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- Removed locked down from tracefs itself and moved it to the trace
directory. Having the open functions there do the lockdown checks.
- Fixed a few races with opening an instance file and the instance being
deleted (Discovered during the locked down updates). Kept separate
from the clean up code such that they can be backported to stable
easier.
- Cleaned up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.
- Fixed a regression in the record mcount code.
- Small hw_lat detector tracer fixes.
- A trace_pipe read fix due to not initializing trace_seq.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXaNhphQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quDIAP4v08ARNdIh+r+c4AOBm3xsOuE/d9GB
I56ydnskm+x2JQD6Ap9ivXe9yDBIErFeHNtCoq7pM8YDI4YoYIB30N0GfwM=
=7oAu
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few tracing fixes:
- Remove lockdown from tracefs itself and moved it to the trace
directory. Have the open functions there do the lockdown checks.
- Fix a few races with opening an instance file and the instance
being deleted (Discovered during the lockdown updates). Kept
separate from the clean up code such that they can be backported to
stable easier.
- Clean up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.
- Fix a regression in the record mcount code.
- Small hw_lat detector tracer fixes.
- A trace_pipe read fix due to not initializing trace_seq"
* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
tracing/hwlat: Report total time spent in all NMIs during the sample
recordmcount: Fix nop_mcount() function
tracing: Do not create tracefs files if tracefs lockdown is in effect
tracing: Add locked_down checks to the open calls of files created for tracefs
tracing: Add tracing_check_open_get_tr()
tracing: Have trace events system open call tracing_open_generic_tr()
tracing: Get trace_array reference for available_tracers files
ftrace: Get a reference counter for the trace_array on filter files
tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2iBvAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvokEACaUozc0Y2L99CqJcRtoH8cwIAimjKTFSmi
UKijO/w11uc9xCIOospkM8iseRbpP/oCIwfYyVROi4LniPB1TSyQJoLF/MELeqIi
lOSg8RrXzC22h5+QG3BR5VO55me3MwIXmJLMMWXbh7FAuUg+wepVwHwzZBurd1Pz
C1wG8NU758AmqJdSGclUbdLLwTBkR+fXoi2EHfdbsBMqGmdZ5VRzCLx70grDcxq2
S9gt+2+diME6o6p1wko+m9pdhUdEy7epPE50K4SA+DU06I77OZpTdSAPilPyQHei
+3HDslzjCyos5/Ay+7hWl42jvyZ+x6GI5VcDLMTcMZsmxLy/WObDB4zcAixJrDV0
c7XWnwCCs2wMXYMsu0ASbkU32mcQ/Ik8CrX0W3/tP1aZBW2LZnOHvXYbybzpbgbo
QY47t3jz21uzF5kplBGxqaY1sPmLxp5V0vQmj9eN+Y7T+hiR/UgWKO/5B4CGNJe0
BQkXNl5cZAtqxmSE8AxlpOla7z6beZUljdhFc7K8cMDdhoatRgm2MBh2G9BB1ayo
VwHc1qLcvInpQrGiC4l5H/F8jrEs/PglknFSs1wlsav1gXA69J0cHA7LY6Czweaw
mkiKr7jA6duBBn6MRUThF5sBMFWYuxp3B/BA4wfDJsSfCmyhWN+OZlsDaEVrzk7o
/XK2Ew9R2Q==
=PUeL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Single small fix for a regression in the sequence logic for linked
commands"
* tag 'for-linus-20191012' of git://git.kernel.dk/linux-block:
io_uring: fix sequence logic for timeout requests
If on boot up, lockdown is activated for tracefs, don't even bother creating
the files. This can also prevent instances from being created if lockdown is
in effect.
Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running the latest kernel through my "make instances" stress tests, I
triggered the following bug (with KASAN and kmemleak enabled):
mkdir invoked oom-killer:
gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
oom_score_adj=0
CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
Call Trace:
dump_stack+0x64/0x8c
dump_header+0x43/0x3b7
? trace_hardirqs_on+0x48/0x4a
oom_kill_process+0x68/0x2d5
out_of_memory+0x2aa/0x2d0
__alloc_pages_nodemask+0x96d/0xb67
__alloc_pages_node+0x19/0x1e
alloc_slab_page+0x17/0x45
new_slab+0xd0/0x234
___slab_alloc.constprop.86+0x18f/0x336
? alloc_inode+0x2c/0x74
? irq_trace+0x12/0x1e
? tracer_hardirqs_off+0x1d/0xd7
? __slab_alloc.constprop.85+0x21/0x53
__slab_alloc.constprop.85+0x31/0x53
? __slab_alloc.constprop.85+0x31/0x53
? alloc_inode+0x2c/0x74
kmem_cache_alloc+0x50/0x179
? alloc_inode+0x2c/0x74
alloc_inode+0x2c/0x74
new_inode_pseudo+0xf/0x48
new_inode+0x15/0x25
tracefs_get_inode+0x23/0x7c
? lookup_one_len+0x54/0x6c
tracefs_create_file+0x53/0x11d
trace_create_file+0x15/0x33
event_create_dir+0x2a3/0x34b
__trace_add_new_event+0x1c/0x26
event_trace_add_tracer+0x56/0x86
trace_array_create+0x13e/0x1e1
instance_mkdir+0x8/0x17
tracefs_syscall_mkdir+0x39/0x50
? get_dname+0x31/0x31
vfs_mkdir+0x78/0xa3
do_mkdirat+0x71/0xb0
sys_mkdir+0x19/0x1b
do_fast_syscall_32+0xb0/0xed
I bisected this down to the addition of the proxy_ops into tracefs for
lockdown. It appears that the allocation of the proxy_ops and then freeing
it in the destroy_inode callback, is causing havoc with the memory system.
Reading the documentation about destroy_inode and talking with Linus about
this, this is buggy and wrong. When defining the destroy_inode() method, it
is expected that the destroy_inode() will also free the inode, and not just
the extra allocations done in the creation of the inode. The faulty commit
causes a memory leak of the inode data structure when they are deleted.
Instead of allocating the proxy_ops (and then having to free it) the checks
should be done by the open functions themselves, and not hack into the
tracefs directory. First revert the tracefs updates for locked_down and then
later we can add the locked_down checks in the kernel/trace files.
Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home
Fixes: ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Stable bugfixes:
- Fix O_DIRECT accounting of number of bytes read/written # v4.1+
Other fixes:
- Fix nfsi->nrequests count error on nfs_inode_remove_request()
- Remove redundant mirror tracking in O_DIRECT
- Fix leak of clp->cl_acceptor string
- Fix race to sk_err after xs_error_report
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl2g7gAACgkQ18tUv7Cl
QOtMSxAAyQj2BjzUviR+clFe4X6betnqse6wHr4gnA3EEyo5OiwYk/U/bevejOXW
mYevz+VkYVVaN4l8SLT/ZJEH7cJCbJUjy/cf6vvHdL0E+sE+41s0Qrl0wrX2NOyT
cgm90VTzwFCZe9e9i2jOehgyJ0zkc0+9H2YySYIsiw0MPHVOzR+t8lgoD9mVsYAn
d4L5VLOo0hP/dhYfS2e/SkESR75rZFR2tZbL2ClKmGTYVHoLpliAtCUepqV9kUpw
FkAA5PXosa0xewXCUg6Lvhac7Urh37OrnLtedIe4fa7qGGqB2U3CavB6W6ojQhKJ
Brgk1/wSVhag3vVCCplwscB5jpOly6azUbs2mcYhdKZ5zWzTkPL1F/KZkZSnkZU6
LpZPk2/Lltko2TUviSCwDJwVzWqqRMlvz7OyXv1tVw53yFP1Fr7tNTxEe4XOnbxG
8pbLqBjwHp8Iyerh0JSJ21quPJSVIfgTKjWbMTjrH4yh/FdzUkeAktvwZT4LDEMx
uKFH8FQPrp/oqQ4wc49gpxLGqNCYjK51Hk3ceym47d1xcDww8yFeaana5D3VERmF
CCuJfqkUxFmeFle8TGBHlmvVrb8K/W1WZPC/dcmMAEQSq07fIlZYvUHTHO8pGOkp
ZZqNtbLyH2yIRR9FuzlrpmEqJPZMGNYySthcEXrYzVKDDDbVaII=
=XDan
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
- Fix O_DIRECT accounting of number of bytes read/written # v4.1+
Other fixes:
- Fix nfsi->nrequests count error on nfs_inode_remove_request()
- Remove redundant mirror tracking in O_DIRECT
- Fix leak of clp->cl_acceptor string
- Fix race to sk_err after xs_error_report"
* tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: fix race to sk_err after xs_error_report
NFSv4: Fix leak of clp->cl_acceptor string
NFS: Remove redundant mirror tracking in O_DIRECT
NFS: Fix O_DIRECT accounting of number of bytes read/written
nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl2gzCMACgkQiiy9cAdy
T1F5aAv+M5OLmFX6OOuCQNYWqku8N9K78C3+bvdpDINYceozhPgKX1/+RO3z8rcR
v+cK0b+R6tXVmpTbyo1t7QW3UVzHiJLEaGKjafbHNBjOM87wYkW7D9Uc6u6GSYDp
7cgXqmMDM+26FftBwxaxLjbjY6YAN/3V66j5Zt6I+lTb+8YC2DO1ihprwaMmGAm2
hDGYpvrW9KPjTB23+dC5Z9/YfG6EGXOvXl4z20JajCVoKD/70KdpDwVLNo2JuPsL
hsgV38xmUd7PtFXeFtStPjN9VpNOf4sp29xmWnjJVGaVx4hnhzcyiKmiW2o5VkKM
14PEUd/N26/+E6VXuY1eP6L8O9EwkZISNZUXWBI/zyyiVedBfSJVeSXTU4NyqB0g
B7WMnWYbJ73AeOsq9iVrcribrRPDtAcyK5UKMWEkrYhB/BIylg1vWUhn6B5bPqv0
3Q9TR+pSmEifkAk3ZSu5CjKNEgpNj6lpFoIcmkdp3Gg/qzeWvcFUTJTnxZZbxSqP
qtmZ3d8z
=mXpP
-----END PGP SIGNATURE-----
Merge tag '5.4-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Eight small SMB3 fixes, four for stable, and important fix for the
recent regression introduced by filesystem timestamp range patches"
* tag '5.4-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Force reval dentry if LOOKUP_REVAL flag is set
CIFS: Force revalidate inode when dentry is stale
smb3: Fix regression in time handling
smb3: remove noisy debug message and minor cleanup
CIFS: Gracefully handle QueryInfo errors during open
cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic
fs: cifs: mute -Wunused-const-variable message
smb3: cleanup some recent endian errors spotted by updated sparse
In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.
This patch will add the missing btrfs_put_block_group() to prevent
memory leak.
Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c
Fixes: 49303381f1 ("Btrfs: bail out if block group has different mixed flag")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two ways a request can be deferred:
1) It's a regular request that depends on another one
2) It's a timeout that tracks completions
We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a rounding error in the fallocate code
- Minor code cleanups
- Make sure to zero memory buffers before formatting metadata blocks
- Fix a few places where we forgot to log an inode metadata update
- Remove broken error handling that tried to clean up after a failure
but still got it wrong
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2eA4EACgkQ+H93GTRK
tOsosxAAgOgEvFrIZQDnzio+yQ9OysiFWCiiT3b+91f8v1v3MNBXP2EEyLA4zZEK
TlgaFvNCoTO2Q16r+PKCSDFVBwEw/8t2YEylS6X6/GKBUGc/BDBZ5ROZdzQH3Wng
a98M7GaQektTUqQuKlv3fHiL1Fucp0FonCG3wiuNw6/vzH6RxxwOTTkd2F67Zkn6
9N7bRoEPh7d6h1Ah7myf/ONA1f3mGEiFuvyb3/KCZPr1WKDFFflzhUboJnMq8Br5
N55Bq3uoc4+btS4eVxr3XjEXD1zImBiXq7gcEoCRDZNfv+/2VcaDYRkXQu40NbOI
psH+1xy9lQvLSSbCm6zI1enpfVAm3qxUVktp9G4i4dKtjJQL7HzuvW7ckdzcRaav
QKbkTmgGQFFI5yLpNHVN+zs+exdkwG6whNyY3Frt5yNh8bH8yRZxrb7YMzfIkNOl
8O1Tl0ZFGzJtjggXPhAKCoGz2yz8oO2G2JzejfWvdyrku2heyGLktnm98Uk/Sx0g
90hHTC8KYfLm8r0PuH+lv5c/AwTDr4prJO2wQkU2ZDCCLxXjNsHa3vuxMDmK6PRV
Y+ryOP6OcyQkB+BkY3/HOlxjen71Z91hdMFNVajRmVwNuFzHCK3yHhsMitBPDEnx
PnotQQihRmVKP8hw7/3zPfLoDEwO+hPa19JbqUFLXp7xl/OzF6A=
=MFMd
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"A couple of small code cleanups and bug fixes for rounding errors,
metadata logging errors, and an extra layer of safeguards against
leaking memory contents.
- Fix a rounding error in the fallocate code
- Minor code cleanups
- Make sure to zero memory buffers before formatting metadata blocks
- Fix a few places where we forgot to log an inode metadata update
- Remove broken error handling that tried to clean up after a failure
but still got it wrong"
* tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move local to extent inode logging into bmap helper
xfs: remove broken error handling on failed attr sf to leaf change
xfs: log the inode on directory sf to block format change
xfs: assure zeroed memory buffers for certain kmem allocations
xfs: removed unused error variable from xchk_refcountbt_rec
xfs: remove unused flags arg from xfs_get_aghdr_buf()
xfs: Fix tail rounding in xfs_alloc_file_space()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2fQkoACgkQxWXV+ddt
WDsXgA/+OpORcroqswa5V/AB5NgSMv08QfBtL7en7cUA2cGbrTt0lAoixIesCeGA
7y4umlx7zWDhrG0NFv8E21xxkMzK72/YQnN0C6ZwRCi+ZbPSDlpMCwSNV9b6oVt4
t9mJQ2kNZ80wj9W7jtoyiiWZ2OBccWywKBYr+BXybha5BliSd1XbpWMv90lODQHc
1oH1FOphPAf+nSEtW4g0BV8UHy1n1+7TqjEHDilj0TuHlO0MHsR2FQcsRnMBv5H/
CcEH3+870pUOjvm/l1OAdIrzrnH6UsXKHnOmJpYZIAQdG4xtHq/d6O9sfQMZK/Af
VMXuju558kGOvrYdgffYa0HuW3P6c8q5BeEtGAZwjGsQS5GxmW63et/x+HIEl4RU
4SWMfD592GK1/yQn31NWGav7Olkr4L0Eh9iqfKk2nWayTV+pqs8NgkGumkoNB66n
WJs2m2WwKvZzj1Ys9ilRzo17qt2U/m4eLQtbut3m5eYCpzvo38PDrqOVqKTZG/dc
nG1JBJNk+yjV+RkOO0gnQ8/zzooEGvMhVIx5iq52tWjvxnvOGSVv9QAAkKrbMoOX
kn/b3heL57Z+kilUU3Zd0GvVif+awIgOj+PmAevR+w3MLoJ2gLfWb7qcONdHyFPk
jK7rsuP737fyHpZ1tvJyfTt4Lk/u1UbApKrO4QIku1r9IJmmu20=
=zpDj
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stabitly fixes, one build warning fix.
- fix inode allocation under NOFS context
- fix leak in fiemap due to concurrent append writes
- fix log-root tree updates
- fix balance convert of single profile on 32bit architectures
- silence false positive warning on old GCCs (code moved in rc1)"
* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: silence maybe-uninitialized warning in clone_range
btrfs: fix uninitialized ret in ref-verify
btrfs: allocate new inode in NOFS context
btrfs: fix balance convert to single on 32-bit host CPUs
btrfs: fix incorrect updating of log root tree
Btrfs: fix memory leak due to concurrent append writes with fiemap
Pull dcache_readdir() fixes from Al Viro:
"The couple of patches you'd been OK with; no hlist conversion yet, and
cursors are still in the list of children"
[ Al is referring to future work to avoid some nasty O(n**2) behavior
with the readdir cursors when you have lots of concurrent readdirs.
This is just a fix for a race with a trivial cleanup - Linus ]
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
libfs: take cursors out of list when moving past the end of directory
Fix the locking in dcache_readdir() and friends
Pull mount fixes from Al Viro:
"A couple of regressions from the mount series"
* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: add missing blkdev_put() in get_tree_bdev()
shmem: fix LSM options parsing
We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.
Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The callers of xfs_bmap_local_to_extents_empty() log the inode
external to the function, yet this function is where the on-disk
format value is updated. Push the inode logging down into the
function itself to help prevent future mistakes.
Note that internal bmap callers track the inode logging flags
independently and thus may log the inode core twice due to this
change. This is harmless, so leave this code around for consistency
with the other attr fork conversion functions.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
together after a failed attempt to convert from shortform to leaf
format. While this code reallocates and copies back the shortform
attr fork data, it never resets the inode format field back to local
format. Further, now that the inode is properly logged after the
initial switch from local format, any error that triggers the
recovery code will eventually abort the transaction and shutdown the
fs. Therefore, remove the broken and unnecessary error handling
code.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When a directory changes from shortform (sf) to block format, the sf
format is copied to a temporary buffer, the inode format is modified
and the updated format filled with the dentries from the temporary
buffer. If the inode format is modified and attempt to grow the
inode fails (due to I/O error, for example), it is possible to
return an error while leaving the directory in an inconsistent state
and with an otherwise clean transaction. This results in corruption
of the associated directory and leads to xfs_dabuf_map() errors as
subsequent lookups cannot accurately determine the format of the
directory. This problem is reproduced occasionally by generic/475.
The fundamental problem is that xfs_dir2_sf_to_block() changes the
on-disk inode format without logging the inode. The inode is
eventually logged by the bmapi layer in the common case, but error
checking introduces the possibility of failing the high level
request before this happens.
Update both of the dir2 and attr callers of
xfs_bmap_local_to_extents_empty() to log the inode core as
consistent with the bmap local to extent format change codepath.
This ensures that any subsequent errors after the format has changed
cause the transaction to abort.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We no longer need the extra mirror length tracking in the O_DIRECT code,
as we are able to track the maximum contiguous length in dreq->max_count.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When a series of O_DIRECT reads or writes are truncated, either due to
eof or due to an error, then we should return the number of contiguous
bytes that were received/sent starting at the offset specified by the
application.
Currently, we are failing to correctly check contiguity, and so we're
failing the generic/465 in xfstests when the race between the read
and write RPCs causes the file to get extended while the 2 reads are
outstanding. If the first read RPC call wins the race and returns with
eof set, we should treat the second read RPC as being truncated.
Reported-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
Fixes: 1ccbad9f9f ("nfs: fix DIO good bytes calculation")
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Mark inode for force revalidation if LOOKUP_REVAL flag is set.
This tells the client to actually send a QueryInfo request to
the server to obtain the latest metadata in case a directory
or a file were changed remotely. Only do that if the client
doesn't have a lease for the file to avoid unneeded round
trips to the server.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently the client indicates that a dentry is stale when inode
numbers or type types between a local inode and a remote file
don't match. If this is the case attributes is not being copied
from remote to local, so, it is already known that the local copy
has stale metadata. That's why the inode needs to be marked for
revalidation in order to tell the VFS to lookup the dentry again
before openning a file. This prevents unexpected stale errors
to be returned to the user space when openning a file.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes: cb7a69e605 ("cifs: Initialize filesystem timestamp ranges")
Only very old servers (e.g. OS/2 and DOS) did not support
DCE TIME (100 nanosecond granularity). Fix the checks used
to set minimum and maximum times.
Fixes xfstest generic/258 (on 5.4-rc1 and later)
CC: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Message was intended only for developer temporary build
In addition cleanup two minor warnings noticed by Coverity
and a trivial change to workaround a sparse warning
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
GCC throws warning message as below:
‘clone_src_i_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
^
fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
u64 clone_src_i_size;
^
The clone_src_i_size is only used as call-by-reference
in a call to get_inode_info().
Silence the warning by initializing clone_src_i_size to 0.
Note that the warning is a false positive and reported by older versions
of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
patch is applied. Setting clone_src_i_size to 0 does not otherwise make
sense and would not do any action in case the code changes in the future.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().
Just use percpu_ref_put() instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from Andrew Morton:
"The usual shower of hotfixes.
Chris's memcg patches aren't actually fixes - they're mature but a few
niggling review issues were late to arrive.
The ocfs2 fixes are quite old - those took some time to get reviewer
attention.
Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
mm/slab-generic"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
mm, sl[ou]b: improve memory accounting
mm, memcg: make scan aggression always exclude protection
mm, memcg: make memory.emin the baseline for utilisation determination
mm, memcg: proportional memory.{low,min} reclaim
mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
mm/page_alloc.c: fix a crash in free_pages_prepare()
mm/z3fold.c: claim page in the beginning of free
kernel/sysctl.c: do not override max_threads provided by userspace
memcg: only record foreign writebacks with dirty pages when memcg is not disabled
mm: fix -Wmissing-prototypes warnings
writeback: fix use-after-free in finish_writeback_work()
mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
panic: ensure preemption is disabled during panic()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
ocfs2: clear zero in unaligned direct IO
In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283
to check whether inode_alloc is NULL:
if (inode_alloc)
When inode_alloc is NULL, it is used on line 287:
ocfs2_inode_lock(inode_alloc, &bh, 0);
ocfs2_inode_lock_full_nested(inode, ...)
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
Thus, a possible null-pointer dereference may occur.
To fix this bug, inode_alloc is checked on line 286.
This bug is found by a static analysis tool STCheck written by us.
Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_write_end_nolock(), there are an if statement on lines 1976,
2047 and 2058, to check whether handle is NULL:
if (handle)
When handle is NULL, it is used on line 2045:
ocfs2_update_inode_fsync_trans(handle, inode, 1);
oi->i_sync_tid = handle->h_transaction->t_tid;
Thus, a possible null-pointer dereference may occur.
To fix this bug, handle is checked before calling
ocfs2_update_inode_fsync_trans().
This bug is found by a static analysis tool STCheck written by us.
Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to
check whether loc->xl_entry is NULL:
if (loc->xl_entry)
When loc->xl_entry is NULL, it is used on line 2158:
ocfs2_xa_add_entry(loc, name_hash);
loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash);
loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size);
and line 2164:
ocfs2_xa_add_namevalue(loc, xi);
loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len);
loc->xl_entry->xe_name_len = xi->xi_name_len;
Thus, possible null-pointer dereferences may occur.
To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry()
abnormally returns with -EINVAL.
These bugs are found by a static analysis tool STCheck written by us.
[akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()]
Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.
Ocfs2 manage disk with cluster size(for example, 1M), part-written in
one cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state
is not set to NEW in function ocfs2_dio_wr_get_block when we write a
WRITTEN cluster. For example, the cluster size is 1M, file size is 8k
and we direct write from 14k to 15k, then 12k~14k and 15k~16k will
contain dirty data.
We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.
We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to
NEW while area in file not.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 9f79b78ef7 ("Convert filldir[64]() from __put_user() to
unsafe_put_user()") I made filldir() use unsafe_put_user(), which
improves code generation on x86 enormously.
But because we didn't have a "unsafe_copy_to_user()", the dirent name
copy was also done by hand with unsafe_put_user() in a loop, and it
turns out that a lot of other architectures didn't like that, because
unlike x86, they have various alignment issues.
Most non-x86 architectures trap and fix it up, and some (like xtensa)
will just fail unaligned put_user() accesses unconditionally. Which
makes that "copy using put_user() in a loop" not work for them at all.
I could make that code do explicit alignment etc, but the architectures
that don't like unaligned accesses also don't really use the fancy
"user_access_begin/end()" model, so they might just use the regular old
__copy_to_user() interface.
So this commit takes that looping implementation, turns it into the x86
version of "unsafe_copy_to_user()", and makes other architectures
implement the unsafe copy version as __copy_to_user() (the same way they
do for the other unsafe_xyz() accessor functions).
Note that it only does this for the copying _to_ user space, and we
still don't have a unsafe version of copy_from_user().
That's partly because we have no current users of it, but also partly
because the copy_from_user() case is slightly different and cannot
efficiently be implemented in terms of a unsafe_get_user() loop (because
gcc can't do asm goto with outputs).
It would be trivial to do this using "rep movsb", which would work
really nicely on newer x86 cores, but really badly on some older ones.
Al Viro is looking at cleaning up all our user copy routines to make
this all a non-issue, but for now we have this simple-but-stupid version
for x86 that works fine for the dirent name copy case because those
names are short strings and we simply don't need anything fancier.
Fixes: 9f79b78ef7 ("Convert filldir[64]() from __put_user() to unsafe_put_user()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently if the client identifies problems when processing
metadata returned in CREATE response, the open handle is being
leaked. This causes multiple problems like a file missing a lease
break by that client which causes high latencies to other clients
accessing the file. Another side-effect of this is that the file
can't be deleted.
Fix this by closing the file after the client hits an error after
the file was opened and the open descriptor wasn't returned to
the user space. Also convert -ESTALE to -EOPENSTALE to allow
the VFS to revalidate a dentry and retry the open.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
After 'Initial git repository build' commit,
'mapping_table_ERRHRD' variable has not been used.
So 'mapping_table_ERRHRD' const variable could be removed
to mute below warning message:
fs/cifs/netmisc.c:120:40: warning: unused variable 'mapping_table_ERRHRD' [-Wunused-const-variable]
static const struct smb_to_posix_error mapping_table_ERRHRD[] = {
^
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Now that sparse has been fixed, it spotted a couple recent minor
endian errors (and removed one additional sparse warning).
Thanks to Luc Van Oostenryck for his help fixing sparse.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Guarantee zeroed memory buffers for cases where potential memory
leak to disk can occur. In these cases, kmem_alloc is used and
doesn't zero the buffer, opening the possibility of information
leakage to disk.
Use existing infrastucture (xfs_buf_allocate_memory) to obtain
the already zeroed buffer from kernel memory.
This solution avoids the performance issue that would occur if a
wholesale change to replace kmem_alloc with kmem_zalloc was done.
Signed-off-by: Bill O'Donnell <billodo@redhat.com>
[darrick: fix bitwise complaint about kmflag_mask]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Removed unused error variable. Instead of using error variable,
returned the value directly as it wasn't updated.
Signed-off-by: Aliasgar Surti <aliasgar.surti500@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The flags arg is always passed as zero, so remove it.
(xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
the sb, but that should never be relevant for xfs_get_aghdr_buf)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
To ensure that all blocks touched by the range [offset, offset + count)
are allocated, we need to calculate the block count from the difference
of the range end (rounded up) and the range start (rounded down).
Before this patch, we just round up the byte count, which may lead to
unaligned ranges not being fully allocated:
$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
0: [0..7]: 1396264..1396271
1: [8..15]: hole
There should not be a hole there. Instead, the first two blocks should
be fully allocated.
With this patch applied, the result is something like this:
$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
0: [0..15]: 11024..11039
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") we
changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
executable mappings.
Then, people reported that it broke some binaries that had overlapping
segments from the same file, and commit ad55eac74f ("elf: enforce
MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
overlaying elf segment cases. But only some - despite the summary line
of that commit, it only did it when it also does a temporary brk vma for
one obvious overlapping case.
Now Russell King reports another overlapping case with old 32-bit x86
binaries, which doesn't trigger that limited case. End result: we had
better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.
Yes, it's a sign of old binaries generated with old tool-chains, but we
do pride ourselves on not breaking existing setups.
This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
and the old load_elf_library() use-cases, because nobody has reported
breakage for those. Yet.
Note that in all the cases seen so far, the overlapping elf sections
seem to be just re-mapping of the same executable with different section
attributes. We could possibly introduce a new MAP_FIXED_NOFILECHANGE
flag or similar, which acts like NOREPLACE, but allows just remapping
the same executable file using different protection flags.
It's not clear that would make a huge difference to anything, but if
people really hate that "elf remaps over previous maps" behavior, maybe
at least a more limited form of remapping would alleviate some concerns.
Alternatively, we should take a look at our elf_map() logic to see if we
end up not mapping things properly the first time.
In the meantime, this is the minimal "don't do that then" patch while
people hopefully think about it more.
Reported-by: Russell King <linux@armlinux.org.uk>
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map")
Fixes: ad55eac74f ("elf: enforce MAP_FIXED on overlaying elf segments")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes getdents() and getdents64() do sanity checking on the
pathname that it gives to user space. And to mitigate the performance
impact of that, it first cleans up the way it does the user copying, so
that the code avoids doing the SMAP/PAN updates between each part of the
dirent structure write.
I really wanted to do this during the merge window, but didn't have
time. The conversion of filldir to unsafe_put_user() is something I've
had around for years now in a private branch, but the extra pathname
checking finally made me clean it up to the point where it is mergable.
It's worth noting that the filename validity checking really should be a
bit smarter: it would be much better to delay the error reporting until
the end of the readdir, so that non-corrupted filenames are still
returned. But that involves bigger changes, so let's see if anybody
actually hits the corrupt directory entry case before worrying about it
further.
* branch 'readdir':
Make filldir[64]() verify the directory entry filename is valid
Convert filldir[64]() from __put_user() to unsafe_put_user()
This has been discussed several times, and now filesystem people are
talking about doing it individually at the filesystem layer, so head
that off at the pass and just do it in getdents{64}().
This is partially based on a patch by Jann Horn, but checks for NUL
bytes as well, and somewhat simplified.
There's also commentary about how it might be better if invalid names
due to filesystem corruption don't cause an immediate failure, but only
an error at the end of the readdir(), so that people can still see the
filenames that are ok.
There's also been discussion about just how much POSIX strictly speaking
requires this since it's about filesystem corruption. It's really more
"protect user space from bad behavior" as pointed out by Jann. But
since Eric Biederman looked up the POSIX wording, here it is for context:
"From readdir:
The readdir() function shall return a pointer to a structure
representing the directory entry at the current position in the
directory stream specified by the argument dirp, and position the
directory stream at the next entry. It shall return a null pointer
upon reaching the end of the directory stream. The structure dirent
defined in the <dirent.h> header describes a directory entry.
From definitions:
3.129 Directory Entry (or Link)
An object that associates a filename with a file. Several directory
entries can associate names with the same file.
...
3.169 Filename
A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
characters composing the name may be selected from the set of all
character values excluding the slash character and the null byte. The
filenames dot and dot-dot have special meaning. A filename is
sometimes referred to as a 'pathname component'."
Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.
Also note that if this ends up being noticeable as a performance
regression, we can fix that to do a much more optimized model that
checks for both NUL and '/' at the same time one word at a time.
We haven't really tended to optimize 'memchr()', and it only checks for
one pattern at a time anyway, and we really _should_ check for NUL too
(but see the comment about "soft errors" in the code about why it
currently only checks for '/')
See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
lookup code looks for pathname terminating characters in parallel.
Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We really should avoid the "__{get,put}_user()" functions entirely,
because they can easily be mis-used and the original intent of being
used for simple direct user accesses no longer holds in a post-SMAP/PAN
world.
Manually optimizing away the user access range check makes no sense any
more, when the range check is generally much cheaper than the "enable
user accesses" code that the __{get,put}_user() functions still need.
So instead of __put_user(), use the unsafe_put_user() interface with
user_access_{begin,end}() that really does generate better code these
days, and which is generally a nicer interface. Under some loads, the
multiple user writes that filldir() does are actually quite noticeable.
This also makes the dirent name copy use unsafe_put_user() with a couple
of macros. We do not want to make function calls with SMAP/PAN
disabled, and the code this generates is quite good when the
architecture uses "asm goto" for unsafe_put_user() like x86 does.
Note that this doesn't bother with the legacy cases. Nobody should use
them anyway, so performance doesn't really matter there.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2WrkYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjf1D/9wy2L/yAXA/KLQkwDZ2kn3tMtzMrsv6bOJ
4UTdlLWKH2ihGg69iAE5f19iQSeNmqqXxBz0VvIBpFR7TqcRbGe0d9iVtoOsDhpZ
c1OvQ2Ey35Xx1T2w6uMh0llgeWY2J/gJkY64unUxwUBUZwNOPA8ZjxqeXcMQmyAt
sYmXpNLCT/f4YTYOYDgzoh5960TsCB/H+m/bLEVRvr0MaonvlaBUKRcysQikDFCQ
yobzDmlSJqGKqlFJ2fnRVSkJC0BmBE5p8Ric9HHiUOT8BO31079IHUGbkbSh/csH
0yPipNaYNMv+Hr0t9pgfcNbAt2weMK5HFgtpQwv8Frl4xjvBSWDS5fQesCVDjkZt
+ROeOvQtjfeKtLy5PCu6BJwYpu8iYG9eGF8zxBQ4FBHM3tghcVhqssaNbfrVOW+u
YXYbAuLMkLwKlmJ+6WBiVIMefyF59ue3+UJGECiCrj/BrgxUyw8HcGKwpKEAZSok
VFGDukL0Y3flnoO/gyOf0GFaD5Uovr1sx82DCz05B/XEMfkqFMJRGkbyZBarJL69
9QrnyGpF4rwtfg+usR1PmJ+9/oY/ypSk8N9MAIkoK9e1YIexxvBiXAf0k8AxuDyC
uPuOiQgKcqUr3aF+ivao8dQB9NiK1bJGc4pqBPPN4ZYRSjMSfBT/cms4IeUyj0K6
sokcB1p+CQ==
=vKVl
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Mandate timespec64 for the io_uring timeout ABI (Arnd)
- Set of NVMe changes via Sagi:
- controller removal race fix from Balbir
- quirk additions from Gabriel and Jian-Hong
- nvme-pci power state save fix from Mario
- Add 64bit user commands (for 64bit registers) from Marta
- nvme-rdma/nvme-tcp fixes from Max, Mark and Me
- Minor cleanups and nits from James, Dan and John
- Two s390 dasd fixes (Jan, Stefan)
- Have loop change block size in DIO mode (Martijn)
- paride pg header ifdef guard (Masahiro)
- Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned
devices and suboptimal performance on others (Ming)
* tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits)
block: sed-opal: fix sparse warning: convert __be64 data
block: sed-opal: fix sparse warning: obsolete array init.
block: pg: add header include guard
Revert "s390/dasd: Add discard support for ESE volumes"
s390/dasd: Fix error handling during online processing
io_uring: use __kernel_timespec in timeout ABI
loop: change queue block size to match when using DIO
blk-mq: apply normal plugging for HDD
blk-mq: honor IO scheduler for multiqueue devices
nvme-rdma: fix possible use-after-free in connect timeout
nvme: Move ctrl sqsize to generic space
nvme: Add ctrl attributes for queue_count and sqsize
nvme: allow 64-bit results in passthru commands
nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
nvmet-tcp: remove superflous check on request sgl
Added QUIRKs for ADATA XPG SX8200 Pro 512GB
nvme-rdma: Fix max_hw_sectors calculation
nvme: fix an error code in nvme_init_subsystem()
nvme-pci: Save PCI state before putting drive into deepest state
nvme-tcp: fix wrong stop condition in io_work
...
Today, put_compat_statfs64() disallows nearly any field value over
2^32 if f_bsize is only 32 bits, but that makes no sense.
compat_statfs64 is there for the explicit purpose of providing 64-bit
fields for f_files, f_ffree, etc. And f_bsize is always only 32 bits.
As a result, 32-bit userspace gets -EOVERFLOW for i.e. large file
counts even with -D_FILE_OFFSET_BITS=64 set.
In reality, only f_bsize and f_frsize can legitimately overflow
(fields like f_type and f_namelen should never be large), so test
only those fields.
This bug was discussed at length some time ago, and this is the proposal
Al suggested at https://lkml.org/lkml/2018/8/6/640. It seemed to get
dropped amid the discussion of other related changes, but this
part seems obviously correct on its own, so I've picked it up and
sent it, for expediency.
Fixes: 64d2ab32ef ("vfs: fix put_compat_statfs64() does not handle errors")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coverity caught a case where we could return with a uninitialized value
in ret in process_leaf. This is actually pretty likely because we could
very easily run into a block group item key and have a garbage value in
ret and think there was an errror. Fix this by initializing ret to 0.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the command:
btrfs balance start -dconvert=single,soft .
on a Raspberry Pi produces the following kernel message:
BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.
Fix by open-coding the check using u64 variables.
Tested by completing the original balance command on several Raspberry
Pis.
Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read. Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.
The modification of the individual log root items are serialized on the
per-log root root_mutex. This means that any modification to the
per-subvol log root_item is completely protected.
However we update the root item in the log root tree outside of the log
root tree log_mutex. We do this in order to allow multiple subvolumes
to be updated in each log transaction.
This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information. Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.
This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at. Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.
Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree. This way we do not modify the
log_root_tree while we're committing it, fixing the problem.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.
Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a recent cleanup patch. noio (bypass) chain is
handled asynchronously against submit chain, therefore
inplace I/O or pagevec cannot be applied to such pages.
Add detailed comment for this as well.
Fixes: 97e86a858b ("staging: erofs: tidy up decompression frontend")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20190922100434.229340-1-gaoxiang25@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
After doing more drop_caches stress test on
our products, I found the mistake introduced by
a very recent cleanup [1].
The current rule is that "erofs_get_meta_page"
should be returned with page locked (although
it's mostly unnecessary for read-only fs after
pages are PG_uptodate), but a fix should be
done for this.
[1] https://lore.kernel.org/r/20190904020912.63925-26-gaoxiang25@huawei.com
Fixes: 618f40ea02 ("erofs: use read_cache_page_gfp for erofs_get_meta_page")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20190921184355.149928-1-gaoxiang25@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
In case of error, the function read_mapping_page() returns
ERR_PTR() not NULL. The NULL test in the return value check
should be replaced with IS_ERR().
Fixes: fe7c242357 ("erofs: use read_mapping_page instead of sb_bread")
Reviewed-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Link: https://lore.kernel.org/r/20190918083033.47780-1-weiyongjun1@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
/MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
=2UM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
Pull more vfs updates from Al Viro:
"A couple of misc patches"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs dynroot: switch to simple_dir_operations
fs/handle.c - fix up kerneldoc
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl2Pzl0ACgkQiiy9cAdy
T1F7aAv9EUA2vEdV+3tyKX17yGm8GBVygANsdMlGqqmRhauO0+KJnrsTR19qh9na
oe0r6EwaS6/JwDtM/Tt0YyjyRS7GDyfT4cNNFVmrJ0fnQV11FJR0X83uzdm3HydH
eOyKNG22TwOeFJ3kWqdvSI0AtfbmIcVoOlUAAKsAsv2ksrJIW7Q1BIgQeD8estUV
j8VjPEIc1c/69UU/H5ktrRHMeT5PO61SV8xGM47WnYkntlFDe1E83xWGoxo996Pc
KdGSrB1edWXK6kSlX3yQWnoo8QxcUm8IjgsudqcnOrhnro9s/cDU5ZU1RlXNQeB8
LMtYwNA7jEu9p3TIibxOCph4gofUWNV25GbEJWOY03NxWReTvgLsMbsreul+XNv9
fow5mvCG94SaE8xDjTvzYRBTeYoXv0WjlTTJjqAlVshirQXk7a2dEBVBipkn0Ea7
0845c3NtR20pDGQs3vVzdStDT2MwkNUl1hN4vE1Zl0p2ClOS+eFVq9MgIEddSLi2
Z0oJsmfg
=o1/m
-----END PGP SIGNATURE-----
Merge tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
"Fixes from the recent SMB3 Test events and Storage Developer
Conference (held the last two weeks).
Here are nine smb3 patches including an important patch for debugging
traces with wireshark, with three patches marked for stable.
Additional fixes from last week to better handle some newly discovered
reparse points, and a fix the create/mkdir path for setting the mode
more atomically (in SMB3 Create security descriptor context), and one
for path name processing are still being tested so are not included
here"
* tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix oplock handling for SMB 2.1+ protocols
smb3: missing ACL related flags
smb3: pass mode bits into create calls
smb3: Add missing reparse tags
CIFS: fix max ea value size
fs/cifs/sess.c: Remove set but not used variable 'capabilities'
fs/cifs/smb2pdu.c: Make SMB2_notify_init static
smb3: fix leak in "open on server" perf counter
smb3: allow decryption keys to be dumped by admin for debugging
Merge active entropy generation updates.
This is admittedly partly "for discussion". We need to have a way
forward for the boot time deadlocks where user space ends up waiting for
more entropy, but no entropy is forthcoming because the system is
entirely idle just waiting for something to happen.
While this was triggered by what is arguably a user space bug with
GDM/gnome-session asking for secure randomness during early boot, when
they didn't even need any such truly secure thing, the issue ends up
being that our "getrandom()" interface is prone to that kind of
confusion, because people don't think very hard about whether they want
to block for sufficient amounts of entropy.
The approach here-in is to decide to not just passively wait for entropy
to happen, but to start actively collecting it if it is missing. This
is not necessarily always possible, but if the architecture has a CPU
cycle counter, there is a fair amount of noise in the exact timings of
reasonably complex loads.
We may end up tweaking the load and the entropy estimates, but this
should be at least a reasonable starting point.
As part of this, we also revert the revert of the ext4 IO pattern
improvement that ended up triggering the reported lack of external
entropy.
* getrandom() active entropy waiting:
Revert "Revert "ext4: make __ext4_get_inode_loc plug""
random: try to actively add entropy rather than passively wait for it
This reverts commit 72dbcf7215.
Instead of waiting forever for entropy that may just not happen, we now
try to actively generate entropy when required, and are thus hopefully
avoiding the problem that caused the nice ext4 IO pattern fix to be
reverted.
So revert the revert.
Cc: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Alexander E. Patrakov <patrakov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar:
- Apply a number of membarrier related fixes and cleanups, which fixes
a use-after-free race in the membarrier code
- Introduce proper RCU protection for tasks on the runqueue - to get
rid of the subtle task_rcu_dereference() interface that was easy to
get wrong
- Misc fixes, but also an EAS speedup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Avoid redundant EAS calculation
sched/core: Remove double update_max_interval() call on CPU startup
sched/core: Fix preempt_schedule() interrupt return comment
sched/fair: Fix -Wunused-but-set-variable warnings
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
sched/membarrier: Skip IPIs when mm->mm_users == 1
selftests, sched/membarrier: Add multi-threaded test
sched/membarrier: Fix p->mm->membarrier_state racy load
sched/membarrier: Call sync_core only before usermode for same mm
sched/membarrier: Remove redundant check
sched/membarrier: Fix private expedited registration check
tasks, sched/core: RCUify the assignment of rq->curr
tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
tasks: Add a count of task RCU users
sched/core: Convert vcpu_is_preempted() from macro to an inline function
sched/fair: Remove unused cfs_rq_clock_task() function
Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.
From the original description:
This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.
The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.
There are two major changes since this was last proposed for mainline:
- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/
- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.
The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.
The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:
lockdown={integrity|confidentiality}
Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.
This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.
New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.
The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.
Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"
* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...
- add a new knfsd file cache, so that we don't have to open and
close on each (NFSv2/v3) READ or WRITE. This can speed up
read and write in some cases. It also replaces our readahead
cache.
- Prevent silent data loss on write errors, by treating write
errors like server reboots for the purposes of write caching,
thus forcing clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving,
so that NFSv4.1 mounts are less likely to hang when a server
already has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
now be limited only by the backend filesystem and the
maximum RPC size.
- Allow the server to enforce use of the correct kerberos
credentials when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl2OoFcVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+dRoP/3OW1NxPjpjbCQWZL0M+O3AYJJla
W8E+uoZKMosFEe/ymokMD0Vn5s47jPaMCifMjHZa2GygW8zHN9X2v0HURx/lob+o
/rJXwMn78N/8kdbfDz2FvaCPeT0IuNzRIFBV8/sSXofqwCBwvPo+cl0QGrd4/xLp
X35qlupx62TRk+kbdRjvv8kpS5SJ7BvR+FSA1WubNYWw2hpdEsr2OCFdGq2Wvthy
DK6AfGBXfJGsOE+HGCSj6ejRV6i0UOJ17P8gRSsx+YT0DOe5E7ROjt+qvvRwk489
wmR8Vjuqr1e40eGAUq3xuLfk5F5NgycY4ekVxk/cTVFNwWcz2DfdjXQUlyPAbrSD
SqIyxN1qdKT24gtr7AHOXUWJzBYPWDgObCVBXUGzyL81RiDdhf38HRNjL2TcSDld
tzCjQ0wbPw+iT74v6qQRY05oS+h3JOtDjU4pxsBnxVtNn4WhGJtaLfWW8o1C1QwU
bc4aX3TlYhDmzU7n7Zjt4rFXGJfyokM+o6tPao1Z60Pmsv1gOk4KQlzLtW/jPHx4
ZwYTwVQUKRDBfC62nmgqDyGI3/Qu11FuIxL2KXUCgkwDxNWN4YkwYjOGw9Lb5qKM
wFpxq6CDNZB/IWLEu8Yg85kDPPUJMoI657lZb7Osr/MfBpU0YljcMOIzLBy8uV1u
9COUbPaQipiWGu/0
=diBo
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.
- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.
- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXYx2zAAKCRDh3BK/laaZ
PFpHAQD2G+F8a9e41jFTJg5YpNKMD8/Pl4T6v9chIO9qPXF2IAEAji0P1JterRfv
ixiBhv54hSwYbk527nxNWE9tP5gAHAQ=
=WCHy
-----END PGP SIGNATURE-----
Merge tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse virtio-fs support from Miklos Szeredi:
"Virtio-fs allows exporting directory trees on the host and mounting
them in guest(s).
This isn't actually a new filesystem, but a glue layer between the
fuse filesystem and a virtio based back-end.
It's similar in functionality to the existing virtio-9p solution, but
significantly faster in benchmarks and has better POSIX compliance.
Further permformance improvements can be achieved by sharing the page
cache between host and guest, allowing for faster I/O and reduced
memory use.
Kata Containers have been including the out-of-tree virtio-fs (with
the shared page cache patches as well) since version 1.7 as an
experimental feature. They have been active in development and plan to
switch from virtio-9p to virtio-fs as their default solution. There
has been interest from other sources as well.
The userspace infrastructure is slated to be merged into qemu once the
kernel part hits mainline.
This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"
* tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
virtio-fs: add virtiofs filesystem
virtio-fs: add Documentation/filesystems/virtiofs.rst
fuse: reserve values for mapping protocol
Small fixes all around:
- avoid overlayfs copy-up for PRIVATE mmaps
- KUMSAN uninitialized warning for transport error
- one syzbot memory leak fix in 9p cache
- internal API cleanup for v9fs_fill_super
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl2OG0MACgkQq06b7GqY
5nB21Q/+P4kD+wBg4d1b74Mf+oK2p2cSvCnYHcwMgkRsuyFgu7JXLSuRy2DLlAa4
F+92x0jvit+yUy8bFqlEsSGGmE/dZC3WfpBOVChZRT/9ZLcxNasowtGDaySrX1e1
tmx5QfYsNL840Cv8Vfos4QJjze1oE6RmSCUUO4KyC250PO/jxEgWGVPv1S8LfTUb
KyhkcKlJ+TCqEON9nZ+CrQ4rndHNc7LxtBjVR0M03UYW55kYAhMLSj/AWqKA5I0q
tARlTSgHHRdj4EJvWvLw7INg7BuvTuuyIzCIo/RfpUvijTDyLXFQI3qgtXosKLSq
wuOacq8ykDRv8nvpLJ98daCw0pQOF2Qw40niFfl8Y7AxnaXggMI+UsVX/BnigVWX
E6kaJm6d/YERee/f7aBsJkjKbwRqO7tMbNDrbK2/AzAaZo92dnB5nsEJnvKtCdjl
aylBujCJq/DPp/SOf0Rl8mr24GdYVIk04hvqpxr4X6a/sacO3EUMh8GkGD7muY63
axbYildLQHyChEn0TNdicFFrVie3A984eiK8sht+visd7c5QRm9iu1lsYtXdVGuz
0OvSRtJuTDCTs6s1DhhVnez9luzH4VZJAj0RYJjcF1M4tzy0sKWUSSUf3RUbyxmY
bocklihTlI5UzZfCP/wChXJ26+QrmcL7j0PrKA0NJy7ooFyT5eI=
=VBrx
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.4' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Some of the usual small fixes and cleanup.
Small fixes all around:
- avoid overlayfs copy-up for PRIVATE mmaps
- KUMSAN uninitialized warning for transport error
- one syzbot memory leak fix in 9p cache
- internal API cleanup for v9fs_fill_super"
* tag '9p-for-5.4' of git://github.com/martinetd/linux:
9p/vfs_super.c: Remove unused parameter data in v9fs_fill_super
9p/cache.c: Fix memory leak in v9fs_cache_session_get_cookie
9p: Transport error uninitialized
9p: avoid attaching writeback_fid on mmap with type PRIVATE
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2OIu4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsedD/4h54330vuq66DsGqzFLonLwFl5YHC5NeJX
aV38j7pAUPHvr9CSnr3d2VwTk/ThBtPI50I9/d9SXh8n1oAAA5C/+nPf1XknME47
giKHr3eb0FNLOySt/Ry284gla8mO0GZM83zUbDMnF0N+tfwAtFvbvgCpsPFK9vdL
xNzMLsq6++pj7p9c6IXd+zv0nmJzDikZQz6PtU1KYlbfnU7hh/cP3CrIHIPtGbyk
c/7tfbKTB9UnbW5guGakZt3BzNViJpK28SKn+S6AlLiEOYpC+55dBbaZQIy5qxHv
CZsx0GJIw0Ya0Lw3UEFp/74krLHq2610jmx/va8P7MZCjZAR675G3mjxKUnC/+SY
mEdLo6vghMNAIqMBWNu59CFQOPnqa8sqRii0q6cRWXSqKiFr1FLN8mstb3Ghh9K0
kGVA/gw6ESWB/e/X6I+pD6pTm6O6BPWEqBzGAWSvavQQIP9YpIzf5j+k3JsRu03/
IIzR6gW9k9u4k0rFlOJKbp1+AO5sK3VtJFR8JGELiRwwgjD91w50gjPYak5OGM37
Mi7OHCxqtwFGTkSvT6RM6om6onBsizrveszkrPUO01bWYIHHbtu6ofLyQlfnEtpv
qbGZtLW6KYj9VsIKZNDfg99Ff79IAOiAZDbXWAu/JKyg/gu1Y9uOiVkNFPJGPNHV
8ourcldMGg==
=DEYH
-----END PGP SIGNATURE-----
Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block
Pull more io_uring updates from Jens Axboe:
"Just two things in here:
- Improvement to the io_uring CQ ring wakeup for batched IO (me)
- Fix wrong comparison in poll handling (yangerkun)
I realize the first one is a little late in the game, but it felt
pointless to hold it off until the next release. Went through various
testing and reviews with Pavel and peterz"
* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
io_uring: make CQ ring wakeups be more efficient
io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
[BUG]
The following script can cause btrfs qgroup data space leak:
mkfs.btrfs -f $dev
mount $dev -o nospace_cache $mnt
btrfs subv create $mnt/subv
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 128m $mnt/subv
for (( i = 0; i < 3; i++)); do
# Create 3 64M holes for latter fallocate to fail
truncate -s 192m $mnt/subv/file
xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
sync
# it's supposed to fail, and each failure will leak at least 64M
# data space
xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
rm $mnt/subv/file
sync
done
# Shouldn't fail after we removed the file
xfs_io -f -c "falloc 0 64m" $mnt/subv/file
[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.
The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.
However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.
This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
This is the correct behavior, but it prevents
btrfs_qgroup_check_reserved_leak() to catch the leakage as the
detector is purely EXTENT_QGROUP_RESERVED flag based.
- Leak the previously reserved data bytes.
The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.
[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.
Fixes: 5247255370 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:
From btrfs_truncate_block() in inode.c:
ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
block_start, blocksize);
if (ret)
goto out;
again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
btrfs_delalloc_release_space(inode, data_reserved,
block_start, blocksize, true);
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
ret = -ENOMEM;
goto out;
}
[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.
In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
|- btrsf_qgroup_free_data()
|- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
|- qgroup_free_reserved_data(reserved=@reserved)
|- clear_record_extent_bits();
|- freed += changeset.bytes_changed;
However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.
This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.
[FIX]
Fix the wrong target io_tree.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda223 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There may be situations when a server negotiates SMB 2.1
protocol version or higher but responds to a CREATE request
with an oplock rather than a lease.
Currently the client doesn't handle such a case correctly:
when another CREATE comes in the server sends an oplock
break to the initial CREATE and the client doesn't send
an ack back due to a wrong caching level being set (READ
instead of RWH). Missing an oplock break ack makes the
server wait until the break times out which dramatically
increases the latency of the second CREATE.
Fix this by properly detecting oplocks when using SMB 2.1
protocol version and higher.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Various SMB3 ACL related flags (for security descriptor and
ACEs for example) were missing and some fields are different
in SMB3 and CIFS. Update cifsacl.h definitions based on
current MS-DTYP specification.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Stable bugfixes:
- Dequeue the request from the receive queue while we're re-encoding # v4.20+
- Fix buffer handling of GSS MIC without slack # 5.1
Features:
- Increase xprtrdma maximum transport header and slot table sizes
- Add support for nfs4_call_sync() calls using a custom rpc_task_struct
- Optimize the default readahead size
- Enable pNFS filelayout LAYOUTGET on OPEN
Other bugfixes and cleanups:
- Fix possible null-pointer dereferences and memory leaks
- Various NFS over RDMA cleanups
- Various NFS over RDMA comment updates
- Don't receive TCP data into a reset request buffer
- Don't try to parse incomplete RPC messages
- Fix congestion window race with disconnect
- Clean up pNFS return-on-close error handling
- Fixes for NFS4ERR_OLD_STATEID handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl2NC04ACgkQ18tUv7Cl
QOs4Tg//bAlGs+dIKixAmeMKmTd6I34laUnuyV/12yPQDgo6bryLrTngfe2BYvmG
2l+8H7yHfR4/gQE4vhR0c15xFgu6pvjBGR0/nNRaXienIPXO4xsQkcaxVA7SFRY2
HjffZwyoBfjyRps0jL+2sTsKbRtSkf9Dn+BONRgesg51jK1jyWkXqXpmgi4uMO4i
ojpTrW81dwo7Yhv08U2A/Q1ifMJ8F9dVYuL5sm+fEbVI/Nxoz766qyB8rs8+b4Xj
3gkfyh/Y1zoMmu6c+r2Q67rhj9WYbDKpa6HH9yX1zM/RLTiU7czMX+kjuQuOHWxY
YiEk73NjJ48WJEep3odess1q/6WiAXX7UiJM1SnDFgAa9NZMdfhqMm6XduNO1m60
sy0i8AdxdQciWYexOXMsBuDUCzlcoj4WYs1QGpY3uqO1MznQS/QUfu65fx8CzaT5
snm6ki5ivqXH/js/0Z4MX2n/sd1PGJ5ynMkekxJ8G3gw+GC/oeSeGNawfedifLKK
OdzyDdeiel5Me1p4I28j1WYVLHvtFmEWEU9oytdG0D/rjC/pgYgW/NYvAao8lQ4Z
06wdcyAM66ViAPrbYeE7Bx4jy8zYRkiw6Y3kIbLgrlMugu3BhIW5Mi3BsgL4f4am
KsqkzUqPZMCOVwDuUILSuPp4uHaR+JTJttywiLniTL6reF5kTiA=
=4Ey6
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- Dequeue the request from the receive queue while we're re-encoding
# v4.20+
- Fix buffer handling of GSS MIC without slack # 5.1
Features:
- Increase xprtrdma maximum transport header and slot table sizes
- Add support for nfs4_call_sync() calls using a custom
rpc_task_struct
- Optimize the default readahead size
- Enable pNFS filelayout LAYOUTGET on OPEN
Other bugfixes and cleanups:
- Fix possible null-pointer dereferences and memory leaks
- Various NFS over RDMA cleanups
- Various NFS over RDMA comment updates
- Don't receive TCP data into a reset request buffer
- Don't try to parse incomplete RPC messages
- Fix congestion window race with disconnect
- Clean up pNFS return-on-close error handling
- Fixes for NFS4ERR_OLD_STATEID handling"
* tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (53 commits)
pNFS/filelayout: enable LAYOUTGET on OPEN
NFS: Optimise the default readahead size
NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
NFSv4: Fix OPEN_DOWNGRADE error handling
pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid
NFSv4: Add a helper to increment stateid seqids
NFSv4: Handle RPC level errors in LAYOUTRETURN
NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
NFSv4: Clean up pNFS return-on-close error handling
pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors
NFS: remove unused check for negative dentry
NFSv3: use nfs_add_or_obtain() to create and reference inodes
NFS: Refactor nfs_instantiate() for dentry referencing callers
SUNRPC: Fix congestion window race with disconnect
SUNRPC: Don't try to parse incomplete RPC messages
SUNRPC: Rename xdr_buf_read_netobj to xdr_buf_read_mic
SUNRPC: Fix buffer handling of GSS MIC without slack
SUNRPC: RPC level errors should always set task->tk_rpc_status
SUNRPC: Don't receive TCP data into a request buffer that has been reset
...
When brk was moved for binaries without an interpreter, it should have
been limited to ET_DYN only. In other words, the special case was an
ET_DYN that lacks an INTERP, not just an executable that lacks INTERP.
The bug manifested for giant static executables, where the brk would end
up in the middle of the text area on 32-bit architectures.
Reported-and-tested-by: Richard Kojedzinszky <richard@kojedz.in>
Fixes: bbdc6076d2 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Minor code cleanups.
- Fix a superblock logging error.
- Ensure that collapse range converts the data fork to extents format
when necessary.
- Revert the ALLOC_USERDATA cleanup because it caused subtle
behavior regressions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2KR6oACgkQ+H93GTRK
tOvtVA//Q6iqCTYnvbIHgNMMP6WvD/qL/G1bTm+ql7KJEyd99PC6EATY3IW2DP2Q
gwj1VDid362eF65XiLoe/4zjhj1SmrpX92B9jDBt3n47sHLnDIjtT+54jo0n51xm
MW+79qPC1luT16pdgurkJo7jGrR5Zj5xcUlNj8qVfH9fD9ZAUu2+OzTD+iWn8qvu
L4geuMG2WKinxlfP6v4y5Fn7X9PiWM14N2N2p+N6ITuhdxyKch/EohI/P0syUcTJ
Ad1za0PFFI+BtI4F9kWWBgoGe0oCqCeU6xCGkK5HJ9XTPmJrlU6WKk5lQJs6T92z
OEAQfWkb/Yc7YbSP+CATH1vBIYQZzvhVXkHd0q1JBqhLOPeBnLhO/gVefjSDBgIq
KlYiGlCn3Rme28KNUX+o9/JAsOVpwnxL1GIUAu4v0V2GQQIx7G0WCykiMgRjl0sm
iCsarAev1eHPVsRWArdR1fFrOQ206tvMTz4zSREYVFMsHn0Zq3c9j3R78azMaUA6
tbAfPlp7p0M90u3uC4FrZHyPPu6VHLNuE82zAwi8oLD1kCO9wxluF5gRg1UVOiB5
Rp4Y/6BeO75039aZka7+5iQzzhiudezsXK138YGBx40nW+2nW2ZLWIc3Kkm2mGl3
aJ0xsMusAv2q/lVKyg7280EjlkPYcHLTkaZ9+hCkmd3A9jcb/l8=
=D/XD
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"There are a couple of bug fixes and some small code cleanups that came
in recently:
- Minor code cleanups
- Fix a superblock logging error
- Ensure that collapse range converts the data fork to extents format
when necessary
- Revert the ALLOC_USERDATA cleanup because it caused subtle behavior
regressions"
* tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: avoid unused to_mp() function warning
xfs: log proper length of superblock
xfs: revert 1baa2800e6 ("xfs: remove the unused XFS_ALLOC_USERDATA flag")
xfs: removed unneeded variable
xfs: convert inode to extent format after extent merge due to shift
Pull jffs2 fix from Al Viro:
"braino fix for mount API conversion for jffs2"
* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jffs2: Fix mounting under new mount API
Merge more updates from Andrew Morton:
- almost all of the rest of -mm
- various other subsystems
Subsystems affected by this patch series:
memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
cleanups, pagemap
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (77 commits)
arch/sparc/include/asm/pgtable_64.h: fix build
mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
ntfs: remove (un)?likely() from IS_ERR() conditions
IB/hfi1: remove unlikely() from IS_ERR*() condition
xfs: remove unlikely() from WARN_ON() condition
wimax/i2400m: remove unlikely() from WARN*() condition
fs: remove unlikely() from WARN_ON() condition
xen/events: remove unlikely() from WARN() condition
checkpatch: check for nested (un)?likely() calls
hexagon: drop empty and unused free_initrd_mem
mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
mm: introduce MADV_PAGEOUT
mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
mm: introduce MADV_COLD
mm: untag user pointers in mmap/munmap/mremap/brk
vfio/type1: untag user pointers in vaddr_get_pfn
tee/shm: untag user pointers in tee_shm_register
media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get
drm/radeon: untag user pointers in radeon_gem_userptr_ioctl
drm/amdgpu: untag user pointers
...
The mounting of jffs2 is broken due to the changes from the new mount API
because it specifies a "source" operation, but then doesn't actually
process it. But because it specified it, it doesn't return -ENOPARAM and
the caller doesn't process it either and the source gets lost.
Fix this by simply removing the source parameter from jffs2 and letting the
VFS deal with it in the default manner.
To test it, enable CONFIG_MTD_MTDRAM and allow the default size and erase
block size parameters, then try and mount the /dev/mtdblock<N> file that
that creates as jffs2. No need to initialise it.
Fixes: ec10a24f10 ("vfs: Convert jffs2 to use the new mount API")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Woodhouse <dwmw2@infradead.org>
cc: Richard Weinberger <richard@nod.at>
cc: linux-mtd@lists.infradead.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.
Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to populate an ACL (security descriptor open context)
on file and directory correct. This patch passes in the
mode. Followon patch will build the open context and the
security descriptor (from the mode) that goes in the open
context.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
userfaultfd code use provided user pointers for vma lookups, which can
only by done with untagged pointers.
Untag user pointers in validate_range().
Link: http://lkml.kernel.org/r/cdc59ddd7011012ca2e689bc88c3b65b1ea7e413.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
In copy_mount_options a user address is being subtracted from TASK_SIZE.
If the address is lower than TASK_SIZE, the size is calculated to not
allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
However if the address is tagged, then the size will be calculated
incorrectly.
Untag the address before subtracting.
Link: http://lkml.kernel.org/r/1de225e4a54204bfd7f25dac2635e31aa4aa1d90.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
brelse() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/cfff3b81-fb5d-af26-7b5e-724266509045@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following gcc warning:
fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:629:6: warning: variable ret set but not used
[-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20190827032932.46622-2-yanaijie@huawei.com
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following gcc warning:
fs/reiserfs/journal.c: In function flush_used_journal_lists:
fs/reiserfs/journal.c:1791:6: warning: variable ret set but not used
[-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20190827032932.46622-1-yanaijie@huawei.com
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/reiserfs/do_balan.c: In function balance_leaf_when_delete:
fs/reiserfs/do_balan.c:245:20: warning: variable ih set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_left:
fs/reiserfs/do_balan.c:301:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:649:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_new_nodes_insert:
fs/reiserfs/do_balan.c:953:7: warning: variable version set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1566379929-118398-8-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/reiserfs/fix_node.c: In function get_num_ver:
fs/reiserfs/fix_node.c:379:6: warning: variable cur_free set but not used [-Wunused-but-set-variable]
fs/reiserfs/fix_node.c: In function dc_check_balance_internal:
fs/reiserfs/fix_node.c:1737:6: warning: variable maxsize set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1566379929-118398-7-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/reiserfs/prints.c: In function check_internal_block_head:
fs/reiserfs/prints.c:749:21: warning: variable blkh set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1566379929-118398-6-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/reiserfs/objectid.c: In function reiserfs_convert_objectid_map_v1:
fs/reiserfs/objectid.c:186:25: warning: variable new_objectid_map set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1566379929-118398-5-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/reiserfs/lbalance.c: In function leaf_paste_entries:
fs/reiserfs/lbalance.c:1325:9: warning: variable old_entry_num set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/1566379929-118398-4-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>