Граф коммитов

71563 Коммитов

Автор SHA1 Сообщение Дата
David Sterba 1aeb6b563a btrfs: clear log tree recovering status if starting transaction fails
When a log recovery is in progress, lots of operations have to take that
into account, so we keep this status per tree during the operation. Long
time ago error handling revamp patch 79787eaab4 ("btrfs: replace many
BUG_ONs with proper error handling") removed clearing of the status in
an error branch. Add it back as was intended in e02119d5a7 ("Btrfs:
Add a write ahead tree log to optimize synchronous operations").

There are probably no visible effects, log replay is done only during
mount and if it fails all structures are cleared so the stale status
won't be kept.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:06 +02:00
David Sterba 6819703f5a btrfs: clear defrag status of a root if starting transaction fails
The defrag loop processes leaves in batches and starting transaction for
each. The whole defragmentation on a given root is protected by a bit
but in case the transaction fails, the bit is not cleared

In case the transaction fails the bit would prevent starting
defragmentation again, so make sure it's cleared.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:06 +02:00
David Sterba 8c5ec99561 btrfs: sysfs: fix format string for some discard stats
The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
format should be %llu, though the actual values would hardly ever
overflow to negative values.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:06 +02:00
Josef Bacik 5963ffcaf3 btrfs: always abort the transaction if we abort a trans handle
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.

Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.

This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data.  This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.

This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.

Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:06 +02:00
Filipe Manana 0d7d316597 btrfs: don't set the full sync flag when truncation does not touch extents
At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.

This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.

However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.

One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.

After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.

Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Filipe Manana 4f7e67378e btrfs: fix misleading and incomplete comment of btrfs_truncate()
The comment at the top of btrfs_truncate() mentions that csum items are
dropped or truncated to the new i_size, but this is wrong and non sense,
as they are unrelated to the i_size and are located in the csums tree and
not on a tree with inode items (fs/subvolume tree or a log tree). Instead
that claim applies to file extent items, so fix the comment to refer to
them instead.

While at it make the whole comment for the function more descriptive and
follow the kernel doc style.

Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Josef Bacik 04587ad9be btrfs: abort transaction if we fail to update the delayed inode
If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Josef Bacik bb385bedde btrfs: fix error handling in __btrfs_update_delayed_inode
If we get an error while looking up the inode item we'll simply bail
without cleaning up the delayed node.  This results in this style of
warning happening on commit:

  WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
  CPU: 0 PID: 76403 Comm: fsstress Tainted: G        W         5.13.0-rc1+ #373
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
  RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
  RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
  RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
  R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
  R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
  FS:  00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
  Call Trace:
   btrfs_commit_transaction+0x43c/0xb00
   ? finish_wait+0x80/0x80
   ? vfs_fsync_range+0x90/0x90
   iterate_supers+0x8c/0x100
   ksys_sync+0x50/0x90
   __do_sys_sync+0xa/0x10
   do_syscall_64+0x3d/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Because the iref isn't dropped and this leaves an elevated node->count,
so any release just re-queues it onto the delayed inodes list.  Fix this
by going to the out label to handle the proper cleanup of the delayed
node.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Josef Bacik a4cb90dc01 btrfs: make btrfs_release_delayed_iref handle the !iref case
Right now we only cleanup the delayed iref if we have
BTRFS_DELAYED_NODE_DEL_IREF set on the node.  However we have some error
conditions that need to cleanup the iref if it still exists, so to make
this code cleaner move the test_bit into btrfs_release_delayed_iref
itself and unconditionally call it in each of the cases instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
David Sterba eb3b505366 btrfs: scrub: per-device bandwidth control
Add sysfs interface to limit io during scrub. We relied on the ionice
interface to do that, eg. the idle class let the system usable while
scrub was running. This has changed when mq-deadline got widespread and
did not implement the scheduling classes. That was a CFQ thing that got
deleted. We've got numerous complaints from users about degraded
performance.

Currently only BFQ supports that but it's not a common scheduler and we
can't ask everybody to switch to it.

Alternatively the cgroup io limiting can be used but that also a
non-trivial setup (v2 required, the controller must be enabled on the
system). This can still be used if desired.

Other ideas that have been explored: piggy-back on ionice (that is set
per-process and is accessible) and interpret the class and classdata as
bandwidth limits, but this does not have enough flexibility as there are
only 8 allowed and we'd have to map fixed limits to each value. Also
adjusting the value would need to lookup the process that currently runs
scrub on the given device, and the value is not sticky so would have to
be adjusted each time scrub runs.

Running out of options, sysfs does not look that bad:

- it's accessible from scripts, or udev rules
- the name is similar to what MD-RAID has
  (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
- the value is sticky at least for filesystem mount time
- adjusting the value has immediate effect
- sysfs is available in constrained environments (eg. system rescue)
- the limit also applies to device replace

Sysfs:

- raw value is in bytes
- values written to the file accept suffixes like K, M
- file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
- 0 means use default priority of IO

The scheduler is a simple deadline one and the accuracy is up to nearest
128K.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Johannes Thumshirn e7ff9e6b8e btrfs: zoned: factor out zoned device lookup
To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.

Factor out the lookup calls into a helper and use it in the submission
paths.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Tian Tao 50535db8fb btrfs: return EAGAIN if defrag is canceled
When inode defrag is canceled, the error is set to EAGAIN but then
overwritten by number of defragmented bytes. As this would hide the
error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
will print the error and continue to next file (as it does in for any
other error).

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Qu Wenruo 1245835d24 btrfs: remove io_failure_record::in_validation
The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors.  In such case, we still need to verify
which sectors are corrupted.

But since we've changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.

This patch will cleanup all io_failure_record::in_validation related
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Qu Wenruo 150e4b0597 btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair.  But we can avoid all
these extra mechanisms if we submit the repair for each sector.

By this, each read repair can be easily handled without the need to
verify which sector is corrupted.

This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.

So this patch will:

- Introduce repair_one_sector()
  The main code submitting repair, which is more or less the same as old
  btrfs_submit_read_repair().
  But this time, it only repairs one sector.

- Make btrfs_submit_read_repair() to handle sectors differently
  There are 3 different cases:

  * Good sector
    We need to release the page and extent, set the range uptodate.

  * Bad sector and failed to submit repair bio
    We need to release the page and extent, but not set the range
    uptodate.

  * Bad sector but repair bio submitted
    The page and extent release will be handled by the submitted repair
    bio. Nothing needs to be done.

  Since btrfs_submit_read_repair() will handle the page and extent
  release now, we need to skip to next bvec even we hit some error.

- Change the lifespan of @uptodate in end_bio_extent_readpage()
  Since now btrfs_submit_read_repair() will handle the full bvec
  which contains any corruption, we don't need to bother updating
  @uptodate bit anymore.
  Just let @uptodate to be local variable inside the main loop,
  so that any error from one bvec won't affect later bvec.

- Only export btrfs_repair_one_sector(), unexport
  btrfs_submit_read_repair()
  The only outside caller for read repair is DIO, which already submits
  its repair for just one sector.
  Only export btrfs_repair_one_sector() for DIO.

This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Qu Wenruo 08508fea07 btrfs: make btrfs_verify_data_csum() to return a bitmap
This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.

As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.

The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).

But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Johannes Thumshirn f4dcfb3045 btrfs: rename check_async_write and let it return bool
The 'check_async_write' function is a helper used in
'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
used for metadata.

Make the function return bool and get rid of the local variable async in
btrfs_submit_metadata_bio storing the result of check_async_write's
tests.

As this is touching all function call sites, also rename it to
should_async_write as this is more in line with the naming we use.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Johannes Thumshirn 06e1e7f422 btrfs: zoned: bail out if we can't read a reliable write pointer
If we can't read a reliable write pointer from a sequential zone fail
creating the block group with an I/O error.

Also if the read write pointer is beyond the end of the respective zone,
fail the creation of the block group on this zone with an I/O error.

While this could also happen in real world scenarios with misbehaving
drives, this issue addresses a problem uncovered by fstests' test case
generic/475.

CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Naohiro Aota 47cdfb5e1d btrfs: zoned: print message when zone sanity check type fails
This extends patch 784daf2b96 ("btrfs: zoned: sanity check zone
type"), the message was supposed to be there but was lost during merge.
We want to make the error noticeable so add it.

Fixes: 784daf2b96 ("btrfs: zoned: sanity check zone type")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 385f421f18 btrfs: handle preemptive delalloc flushing slightly differently
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing.  However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes.  We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case.  Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 3e10156997 btrfs: only ignore delalloc if delalloc is much smaller than ordered
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush.  This is because we skip preemptive flushing if delalloc <=
ordered.  However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.

Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 30acce4eb0 btrfs: don't include the global rsv size in the preemptive used amount
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs.  However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation.  If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 1239e2da16 btrfs: use the global rsv size in the preemptive thresh calculation
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space.

However we weren't taking into account the global block rsv, which is
essentially hard used space.  Handle this by subtracting it from the
available free space, so that our threshold more closely mirrors
reality.

We need to do the check because it's possible that the global_rsv_size +
used is > total_bytes, sometimes the global reserve can end up being
calculated as larger than the available size (think small filesystems
where we only have the original 8MiB chunk of metadata).  It doesn't
usually happen, but that can get us into trouble so this is safer.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 610a6ef44e btrfs: take into account global rsv in need_preemptive_reclaim
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim.  Deal with this by
including the global block rsv size in the "total used" calculation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik 0aae4ca9e9 btrfs: only clamp the first time we have to start flushing
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush.

Instead of doing this, simply do it every time we have to start
flushing, this will make us ramp up gradually instead of going to max
clamping as soon as we start needing to do flushing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik ed738ba7f9 btrfs: check worker before need_preemptive_reclaim
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Su Yue 94358c35d8 btrfs: remove stale comment for argument seed of btrfs_find_device
Commit b2598edf8b ("btrfs: remove unused argument seed from
btrfs_find_device") removed the argument seed from btrfs_find_device
but forgot the comment, so remove it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Goldwyn Rodrigues dc56219fe2 btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage()
try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.

Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Steve French e0ae8a9aae smb311: remove dead code for non compounded posix query info
Although we may need this in some cases in the future, remove the
currently unused, non-compounded version of POSIX query info,
SMB11_posix_query_info (instead smb311_posix_query_path_info is now
called e.g. when revalidating dentries or retrieving info for getattr)

Addresses-Coverity: 1495708 ("Resource leaks")
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Steve French e39df24169 cifs: fix SMB1 error path in cifs_get_file_info_unix
We were trying to fill in uninitialized file attributes in the error case.

Addresses-Coverity: 139689 ("Uninitialized variables")
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Steve French ff93b71a3e smb3: fix uninitialized value for port in witness protocol move
Although in practice this can not occur (since IPv4 and IPv6 are the
only two cases currently supported), it is cleaner to avoid uninitialized
variable warnings.

Addresses smatch warning:
  fs/cifs/cifs_swn.c:468 cifs_swn_store_swn_addr() error: uninitialized symbol 'port'.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Steve French 3559134ecc cifs: fix unneeded null check
tcon can not be null in SMB2_tcon function so the check
is not relevant and removing it makes Coverity happy.

Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Addresses-Coverity: 13250131 ("Dereference before null check")
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Steve French 929be906fa cifs: use SPDX-Licence-Identifier
Add SPDX license identifier and replace license boilerplate.
Corrects various checkpatch errors with the older format for
noting the LGPL license.

Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Baokun Li a506ccb47c cifs: convert list_for_each to entry variant in cifs_debug.c
convert list_for_each() to list_for_each_entry() where
applicable.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Baokun Li 647f592734 cifs: convert list_for_each to entry variant in smb2misc.c
convert list_for_each() to list_for_each_entry() where
applicable.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Ronnie Sahlberg ca38fabc31 cifs: avoid extra calls in posix_info_parse
In posix_info_parse() we call posix_info_sid_size twice for each of the owner and the group
sid. The first time to check that it is valid, i.e. >= 0 and the second time
to just pass it in as a length to memcpy().
As this is a pure function we know that it can not be negative the second time and this
is technically a false warning in coverity.
However, as it is a pure function we are just wasting cycles by calling it a second time.
Record the length from the first time we call it and save some cycles as well as make
Coverity happy.

Addresses-Coverity-ID: 1491379 ("Argument can not be negative")

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Thiago Rafael Becker 6efa994e35 cifs: retry lookup and readdir when EAGAIN is returned.
According to the investigation performed by Jacob Shivers at Red Hat,
cifs_lookup and cifs_readdir leak EAGAIN when the user session is
deleted on the server. Fix this issue by implementing a retry with
limits, as is implemented in cifs_revalidate_dentry_attr.

Reproducer based on the work by Jacob Shivers:

  ~~~
  $ cat readdir-cifs-test.sh
  #!/bin/bash

  # Install and configure powershell and sshd on the windows
  #  server as descibed in
  # https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview
  # This script uses expect(1)

  USER=dude
  SERVER=192.168.0.2
  RPATH=root
  PASS='password'

  function debug_funcs {
  	for line in $@ ; do
  		echo "func $line +p" > /sys/kernel/debug/dynamic_debug/control
  	done
  }

  function setup {
  	echo 1 > /proc/fs/cifs/cifsFYI
  	debug_funcs wait_for_compound_request \
                smb2_query_dir_first cifs_readdir \
                compound_send_recv cifs_reconnect_tcon \
                generic_ip_connect cifs_reconnect \
                smb2_reconnect_server smb2_reconnect \
                cifs_readv_from_socket cifs_readv_receive
  	tcpdump -i eth0 -w cifs.pcap host 192.168.2.182 & sleep 5
  	dmesg -C
  }

  function test_call {
  	if [[ $1 == 1 ]] ; then
  		tracer="strace -tt -f -s 4096 -o trace-$(date -Iseconds).txt"
  	fi
        # Change the command here to anything appropriate
  	$tracer ls $2 > /dev/null
  	res=$?
  	if [[ $1 == 1 ]] ; then
  		if [[ $res == 0 ]] ; then
  			1>&2 echo success
  		else
  			1>&2 echo "failure ($res)"
  		fi
  	fi
  }

  mountpoint /mnt > /dev/null || mount -t cifs -o username=$USER,pass=$PASS //$SERVER/$RPATH /mnt

  test_call 0 /mnt/

  /usr/bin/expect << EOF
  	set timeout 60

  	spawn ssh $USER@$SERVER

  	expect "yes/no" {
  		send "yes\r"
  		expect "*?assword" { send "$PASS\r" }
  	} "*?assword" { send "$PASS\r" }

  	expect ">" { send "powershell close-smbsession -force\r" }
  	expect ">" { send "exit\r" }
  	expect eof
  EOF

  sysctl -w vm.drop_caches=2 > /dev/null
  sysctl -w vm.drop_caches=2 > /dev/null

  setup

  test_call 1 /mnt/
  ~~~

Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara 889c2a7007 cifs: fix check of dfs interlinks
Interlink is a special type of DFS link that resolves to a different
DFS domain-based namespace.  To determine whether it is an interlink
or not, check if ReferralServers and StorageServers bits are set to 1
and 0 respectively in ReferralHeaderFlags, as specified in MS-DFSC
3.1.5.4.5 Determining Whether a Referral Response is an Interlink.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Hyunchul Lee 0475c3655e cifs: decoding negTokenInit with generic ASN1 decoder
Decode negTokenInit with lib/asn1_decoder. For that,
add OIDs in linux/oid_registry.h and a negTokenInit
ASN1 file, "spnego_negtokeninit.asn1".
And define decoder's callback functions, which
are the gssapi_this_mech for checking SPENGO oid and
the neg_token_init_mech_type for getting authentication
mechanisms supported by a server.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara 1023e90b73 cifs: avoid starvation when refreshing dfs cache
When refreshing the DFS cache, keep SMB2 IOCTL calls as much outside
critical sections as possible and avoid read/write starvation when
getting new DFS referrals by using broken or slow connections.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Steve French 0d52df81e0 cifs: enable extended stats by default
CONFIG_CIFS_STATS2 can be very useful since it shows
latencies by command, and allows enabling the slow response
dynamic tracepoint which can be useful to identify
performance problems.

For example:

Total time spent processing by command. Time units are jiffies (1000 per second)
  SMB3 CMD	Number	Total Time	Fastest	Slowest
  --------	------	----------	-------	-------
  0		1	2		2	2
  1		2	6		2	4
  2		0	0		0	0
  3		4	11		2	4
  4		2	16		5	11
  5		4546	34104		2	487
  6		4421	32901		2	487
  7		0	0		0	0
  8		695	2781		2	39
  9		391	1708		2	27
  10		0	0		0	0
  11		4	6		1	2
  12		0	0		0	0
  13		0	0		0	0
  14		3887	17696		0	128
  15		0	0		0	0
  16		1471	9950		1	487
  17		169	2695		9	116
  18		80	381		2	10
  1		2	6		2	4
  2		0	0		0	0
  3		4	11		2	4
  4		2	16		5	11
  5		4546	34104		2	487
  6		4421	32901		2	487
  7		0	0		0	0
  8		695	2781		2	39
  9		391	1708		2	27
  10		0	0		0	0
  11		4	6		1	2
  12		0	0		0	0
  13		0	0		0	0
  14		3887	17696		0	128
  15		0	0		0	0
  16		1471	9950		1	487
  17		169	2695		9	116
  18		80	381		2	10

Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Shyam Prasad N e695a9ad03 cifs: missed ref-counting smb session in find
When we lookup an smb session based on session id,
we did not up the ref-count for the session. This can
potentially cause issues if the session is freed from under us.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara f3c852b0b0 cifs: do not share tcp servers with dfs mounts
It isn't enough to have unshared tcons because multiple DFS mounts can
connect to same target server and failover to different servers, so we
can't use a single tcp server for such cases.

For the simplest solution, use nosharesock option to achieve that.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara c950fc7af9 cifs: set a minimum of 2 minutes for refreshing dfs cache
We don't want to refresh the dfs cache in very short intervals, so
setting a minimum interval of 2 minutes is OK.

If it needs to be refreshed immediately, one could have the cache
cleared with

	$ echo 0 > /proc/fs/cifs/dfscache

and then remounting the dfs share.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara 42caeba713 cifs: fix path comparison and hash calc
Fix cache lookup and hash calculations when handling paths with
different cases.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:17 -05:00
Paulo Alcantara c870a8e70e cifs: handle different charsets in dfs cache
Convert all dfs paths to dfs cache's local codepage (@cache_cp) and
avoid mixing them with different charsets.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Paulo Alcantara c9f7110399 cifs: keep referral server sessions alive
At every mount, keep all sessions alive that were used for chasing the
DFS referrals as long as the dfs mounts are active.

Use those sessions in DFS cache to refresh all active tcons as well as
cached entries.  They will be managed by a list of mount_group
structures that will be indexed by a randomly generated uuid at mount
time, so we can put all the sessions related to specific dfs mounts
and avoid leaking them.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Paulo Alcantara 2b133b7e21 cifs: get rid of @noreq param in __dfs_cache_find()
@noreq param isn't used anywhere, so just remove it.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Paulo Alcantara f3191fc800 cifs: do not send tree disconnect to ipc shares
On session close, the IPC is closed and the server must release all
tcons of the session.  It doesn't matter if we send a ipc close or
not.

Besides, it will make the server to not close durable and resilient
files on session close, as specified in MS-SMB2 3.3.5.6 Receiving an
SMB2 LOGOFF Request.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Ronnie Sahlberg 966a3cb7c7 cifs: improve fallocate emulation
RHBZ: 1866684

We don't have a real fallocate in the SMB2 protocol so we used to emulate fallocate
by simply switching the file to become non-sparse. But as that could potantially consume
a lot more data than we intended to fallocate (large sparse file and fallocating a thin
slice in the middle) we would only do this IFF the fallocate request was for virtually
the entire file.

This patch improves this and starts allowing us to fallocate smaller chunks of a file by
overwriting the region with 0, for the parts that are unallocated.

The method used is to first query the server for FSCTL_QUERY_ALLOCATED_RANGES to find what
is unallocated in the fallocate range and then to only overwrite-with-zero the unallocated
ranges to fill in the holes.

As overwriting-with-zero is different from just allocating blocks, and potentially much
more expensive, we limit this to only allow fallocate ranges up to 1Mb in size.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Baokun Li aaf36df3ed cifs: fix doc warnings in cifs_dfs_ref.c
Add description for `cifs_compose_mount_options` to fix the W=1 warnings:

 fs/cifs/cifs_dfs_ref.c:139: warning: Function parameter or
  member 'devname' not described in 'cifs_compose_mount_options'

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Colin Ian King 032e091d3e cifs: remove redundant initialization of variable rc
The variable rc is being initialized with a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Rikard Falkeborn 57c8ce7ab3 cifs: Constify static struct genl_ops
The only usage of cifs_genl_ops[] is to assign its address to the ops
field in the genl_family struct, which is a pointer to const. Make it
const to allow the compiler to put it in read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
YueHaibing a23a71abca cifs: Remove unused inline function is_sysvol_or_netlogon()
is_sysvol_or_netlogon() is never used, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Steve French f2756527d3 cifs: remove duplicated prototype
smb2_find_smb_ses was defined twice in smb2proto.h

Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Aurelien Aptel 5e538959f0 cifs: fix ipv6 formating in cifs_ses_add_channel
Use %pI6 for IPv6 addresses

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-06-20 21:28:16 -05:00
Linus Torvalds 6fab154a33 for-5.13-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDNEFMACgkQxWXV+ddt
 WDuZQg/7BpGG3IDhxydM7fUrNT0xmW2/0VG8blXAgNTiaUO1zOrlrlDKm38+dtW6
 yEv3ruf68tggrPNRCkyh51n45+ExqNwc7WwrxaKIRKmvYhYDsxnt8JLiNkv64isi
 R/CQVETX1cKsMuRhBuqmUq3Sy6VJZoi6coUHIC7ddBcLqnz8c9p7oGqzxBT8J9u3
 1CkDSeLM4HKlISlVKhmT4lRG28cQTuy3mSABUt7N5ljJvrrpQAvEN1HCOE9XUQFQ
 wHH2DjNnBMvfB7mrGCBL7XGf8DF6ucgcyfofuOj6CQLFJ8bOnVKsk8dk/8XUQod+
 rQoNIrVwW91LjmEO/I767JmjrRMtHbXvl3DEw3BvaD/O4efw78SN2VN+DRi4j7Xx
 aMtAWWfakfIyyJNZu9IEDa736iCdp+yl4bnq+hZpqmOYRqTq8n/zWuCMWZ5ohNay
 JyjxCm+xgo3vH9VEgzje6GDUki3I4Bwe7VlsaMr9F6F5GKzFp/4fb9lCewBrH6le
 +Y4gWxRT09plThsC2N3qmBQ9uVIJUyzmvcsYiMJ95tb24srdcPUTCG0C9zBvuMCC
 nm+1n5d3ENSEBaRNKQsC3MYcjKIh8VDEaKnntJrHAzHP41hrD+makrw3LVs6wLzu
 amGYz40XNq8zK2Xxv/N8O/U/PwQWKGj4bxq/2c1Wi9p9HACWfgk=
 =JbJO
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix, for a space accounting bug in zoned mode. It happens
  when a block group is switched back rw->ro and unusable bytes (due to
  zoned constraints) are subtracted twice.

  It has user visible effects so I consider it important enough for late
  -rc inclusion and backport to stable"

* tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix negative space_info->bytes_readonly
2021-06-18 16:39:03 -07:00
Matthew Wilcox (Oracle) 9620ad86d0 afs: Re-enable freezing once a page fault is interrupted
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter.  This may be reported by lockdep like this:

====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
 #0: ffff888104eac530
 (
sb_pagefaults

as filesystem freezing is modelled as a lock.

Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.

Fixes: 1cf7a1518a ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-18 13:49:07 -07:00
Zhihao Cheng 819f9ab430 ubifs: Remove ui_mutex in ubifs_xattr_get and change_xattr
Since ubifs_xattr_get and ubifs_xattr_set cannot being executed
parallelly after importing @host_ui->xattr_sem, now we can remove
ui_mutex imported by commit ab92a20bce ("ubifs: make
ubifs_[get|set]xattr atomic").

@xattr_size, @xattr_names and @xattr_cnt can't be out of protection
by @host_ui->mutex yet, they are sill accesed in other places, such as
pack_inode() called by ubifs_write_inode() triggered by page-writeback.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18 22:04:47 +02:00
Zhihao Cheng f4e3634a3b ubifs: Fix races between xattr_{set|get} and listxattr operations
UBIFS may occur some problems with concurrent xattr_{set|get} and
listxattr operations, such as assertion failure, memory corruption,
stale xattr value[1].

Fix it by importing a new rw-lock in @ubifs_inode to serilize write
operations on xattr, concurrent read operations are still effective,
just like ext4.

[1] https://lore.kernel.org/linux-mtd/20200630130438.141649-1-houtao1@huawei.com

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org  # v2.6+
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18 22:04:47 +02:00
Dan Carpenter be076fdf83 ubifs: fix snprintf() checking
The snprintf() function returns the number of characters (not
counting the NUL terminator) that it would have printed if we
had space.

This buffer has UBIFS_DFS_DIR_LEN characters plus one extra for
the terminator.  Printing UBIFS_DFS_DIR_LEN is okay but anything
higher will result in truncation.  Thus the comparison needs to be
change from == to >.

These strings are compile time constants so this patch doesn't
affect runtime.

Fixes: ae380ce047 ("UBIFS: lessen the size of debugging info data structure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alexander Dahl <ada@thorsis.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18 22:04:47 +02:00
Zhen Lei a2c2a622d4 ubifs: journal: Fix error return code in ubifs_jnl_write_inode()
Fix to return a negative error code from the error handling case instead
of 0, as done elsewhere in this function.

Fixes: 9ca2d73264 ("ubifs: Limit number of xattrs per inode")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18 22:04:47 +02:00
Miklos Szeredi b89ecd60d3 fuse: ignore PG_workingset after stealing
Fix the "fuse: trying to steal weird page" warning.

Description from Johannes Weiner:

  "Think of it as similar to PG_active. It's just another usage/heat
   indicator of file and anon pages on the reclaim LRU that, unlike
   PG_active, persists across deactivation and even reclaim (we store it in
   the page cache / swapper cache tree until the page refaults).

   So if fuse accepts pages that can legally have PG_active set,
   PG_workingset is fine too."

Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Fixes: 1899ad18c6 ("mm: workingset: tell cache transitions from workingset thrashing")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-18 21:16:42 +02:00
Dave Chinner a79b28c284 xfs: separate CIL commit record IO
To allow for iclog IO device cache flush behaviour to be optimised,
we first need to separate out the commit record iclog IO from the
rest of the checkpoint so we can wait for the checkpoint IO to
complete before we issue the commit record.

This separation is only necessary if the commit record is being
written into a different iclog to the start of the checkpoint as the
upcoming cache flushing changes requires completion ordering against
the other iclogs submitted by the checkpoint.

If the entire checkpoint and commit is in the one iclog, then they
are both covered by the one set of cache flush primitives on the
iclog and hence there is no need to separate them for ordering.

Otherwise, we need to wait for all the previous iclogs to complete
so they are ordered correctly and made stable by the REQ_PREFLUSH
that the commit record iclog IO issues. This guarantees that if a
reader sees the commit record in the journal, they will also see the
entire checkpoint that commit record closes off.

This also provides the guarantee that when the commit record IO
completes, we can safely unpin all the log items in the checkpoint
so they can be written back because the entire checkpoint is stable
in the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-18 08:24:23 -07:00
Geert Uytterhoeven 18842e0a4f xfs: Fix 64-bit division on 32-bit in xlog_state_switch_iclogs()
On 32-bit (e.g. m68k):

    ERROR: modpost: "__udivdi3" [fs/xfs/xfs.ko] undefined!

Fix this by using a uint32_t intermediate, like before.

Reported-by: noreply@ellerman.id.au
Fixes: 7660a5b48fbef958 ("xfs: log stripe roundoff is a property of the log")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-18 08:24:19 -07:00
Pavel Begunkov 7a778f9dc3 io_uring: improve in tctx_task_work() resubmission
If task_state is cleared, io_req_task_work_add() will go the slow path
adding a task_work, setting the task_state, waking up the task and so
on. Not to mention it's expensive. tctx_task_work() first clears the
state and then executes all the work items queued, so if any of them
resubmits or adds new task_work items, it would unnecessarily go through
the slow path of io_req_task_work_add().

Let's clear the ->task_state at the end. We still have to check
->task_list for emptiness afterward to synchronise with
io_req_task_work_add(), do that, and set the state back if we're going
to retry, because clearing not-ours task_state on the next iteration
would be buggy.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ef72cdac7022adf0cd7ce4bfe3bb5c82a62eb93.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov 16f7207038 io_uring: don't resched with empty task_list
Entering tctx_task_work() with empty task_list is a strange scenario,
that can happen only on rare occasion during task exit, so let's not
check for task_list emptiness in advance and do it do-while style. The
code still correct for the empty case, just would do extra work about
which we don't care.

Do extra step and do the check before cond_resched(), so we don't
resched if have nothing to execute.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c4173e288e69793d03c7d7ce826f9d28afba718a.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov c6538be9e4 io_uring: refactor tctx task_work list splicing
We don't need a full copy of tctx->task_list in tctx_task_work(), but
only a first one, so just assign node directly.

Taking into account that task_works are run in a context of a task,
it's very unlikely to first see non-empty tctx->task_list and then
splice it empty, can only happen with task_work cancellations that is
not-normal slow path anyway. Hence, get rid of the check in the end,
it's there not for validity but "performance" purposes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d076c83fedb8253baf43acb23b8fafd7c5da1714.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov ebd0df2e63 io_uring: optimise task_work submit flushing
tctx_task_work() tries to fetch a next batch of requests, but before it
would flush completions from the previous batch that may be sub-optimal.
E.g. io_req_task_queue() executes a head of the link where all the
linked may be enqueued through the same io_req_task_queue(). And there
are more cases for that.

Do the flushing at the end, so it can cache completions of several waves
of a single tctx_task_work(), and do the flush at the very end.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3cac83934e4fbce520ff8025c3524398b3ae0270.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov 3f18407dc6 io_uring: inline __tctx_task_work()
Inline __tctx_task_work() into tctx_task_work() in preparation for
further optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f9c05c4bc9763af7bd8e25ebc3c5f7b6f69148f8.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov a3dbdf54da io_uring: refactor io_get_sequence()
Clean up io_get_sequence() and add a comment describing the magic around
sequence correction.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f55dc409936b8afa4698d24b8677a34d31077ccb.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov c854357bc1 io_uring: clean all flags in io_clean_op() at once
Clean all flags in io_clean_op() in the end in one operation, will save
us a couple of operation and binary size.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8efe1f022a037f74e7fe497c69fb554d59bfeaf.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov 1dacb4df4e io_uring: simplify iovec freeing in io_clean_op()
We don't get REQ_F_NEED_CLEANUP for rw unless there is ->free_iovec set,
so remove the optimisation of NULL checking it inline, kfree() will take
care if that would ever be the case.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a233dc655d3d45bd4f69b73d55a61de46d914415.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov b8e64b5300 io_uring: track request creds with a flag
Currently, if req->creds is not NULL, then there are creds assigned.
Track the invariant with a new flag in req->flags. No need to clear the
field at init, and also cleanup can be efficiently moved into
io_clean_op().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5f8baeb8d3b909487f555542350e2eac97005556.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov c10d1f986b io_uring: move creds from io-wq work to io_kiocb
io-wq now doesn't have anything to do with creds now, so move ->creds
from struct io_wq_work into request (aka struct io_kiocb).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov 2a2758f26d io_uring: refactor io_submit_flush_completions()
struct io_comp_state is always contained in struct io_ring_ctx, don't
pass them into io_submit_flush_completions() separately, it makes the
interface cleaner and simplifies it for the compiler.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/44d6ca57003a82484338e95197024dbd65a1b376.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Pavel Begunkov e6ab8991c5 io_uring: fix false WARN_ONCE
WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_wake_worker fs/io-wq.c:244 [inline]
WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_enqueue+0x7f6/0x910 fs/io-wq.c:751

A WARN_ON_ONCE() in io_wqe_wake_worker() can be triggered by a valid
userspace setup. Replace it with pr_warn.

Reported-by: syzbot+ea2f1484cffe5109dc10@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7ede342c3342c4c26668f5168e2993e38bbd99c.1623949695.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18 09:22:02 -06:00
Dave Chinner a6a65fef5e xfs: log stripe roundoff is a property of the log
We don't need to look at the xfs_mount and superblock every time we
need to do an iclog roundoff calculation. The property is fixed for
the life of the log, so store the roundoff in the log at mount time
and use that everywhere.

On a debug build:

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
  27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-06-18 08:21:48 -07:00
Shaokun Zhang 9bb38aa080 xfs: remove redundant initialization of variable error
'error' will be initialized, so clean up the redundant initialization.

Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-18 08:14:31 -07:00
Dave Chinner 90e2c1c20a xfs: perag may be null in xfs_imap()
Dan Carpenter's static checker reported:

The patch 7b13c5155182: "xfs: use perag for ialloc btree cursors"
from Jun 2, 2021, leads to the following Smatch complaint:

    fs/xfs/libxfs/xfs_ialloc.c:2403 xfs_imap()
    error: we previously assumed 'pag' could be null (see line 2294)

And it's right. Fix it.

Fixes: 7b13c51551 ("xfs: use perag for ialloc btree cursors")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-06-18 08:14:20 -07:00
Darrick J. Wong d1015e2ebd Merge tag 'xfs-delay-ready-attrs-v20.1' of https://github.com/allisonhenderson/xfs_work into xfs-5.14-merge4
xfs: Delay Ready Attributes

Hi all,

This set is a subset of a larger series for Dealyed Attributes. Which is a
subset of a yet larger series for parent pointers. Delayed attributes allow
attribute operations (set and remove) to be logged and committed in the same
way that other delayed operations do. This allows more complex operations (like
parent pointers) to be broken up into multiple smaller transactions. To do
this, the existing attr operations must be modified to operate as a delayed
operation.  This means that they cannot roll, commit, or finish transactions.
Instead, they return -EAGAIN to allow the calling function to handle the
transaction.  In this series, we focus on only the delayed attribute portion.
We will introduce parent pointers in a later set.

The set as a whole is a bit much to digest at once, so I usually send out the
smaller sub series to reduce reviewer burn out.  But the entire extended series
is visible through the included github links.

Updates since v19: Added Darricks fix for the remote block accounting as well as
some minor nits about the default assert in xfs_attr_set_iter.  Spent quite
a bit of time testing this cycle to weed out any more unexpected bugs.  No new
test failures were observed with the addition of this set.

xfs: Fix default ASSERT in xfs_attr_set_iter
  Replaced the assert with ASSERT(0);

xfs: Add delay ready attr remove routines
  Added Darricks fix for remote block accounting

This series can be viewed on github here:
https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_v20

As well as the extended delayed attribute and parent pointer series:
https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_v20_extended

And the test cases:
https://github.com/allisonhenderson/xfs_work/tree/pptr_xfstestsv3
In order to run the test cases, you will need have the corresponding xfsprogs

changes as well.  Which can be found here:
https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_xfsprogs_v20
https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_xfsprogs_v20_extended

To run the xfs attributes tests run:
check -g attr

To run as delayed attributes run:
export MOUNT_OPTIONS="-o delattr"
check -g attr

To run parent pointer tests:
check -g parent

I've also made the corresponding updates to the user space side as well, and ported anything
they need to seat correctly.

Questions, comment and feedback appreciated!

Thanks all!
Allison

* tag 'xfs-delay-ready-attrs-v20.1' of https://github.com/allisonhenderson/xfs_work:
  xfs: Make attr name schemes consistent
  xfs: Fix default ASSERT in xfs_attr_set_iter
  xfs: Clean up xfs_attr_node_addname_clear_incomplete
  xfs: Remove xfs_attr_rmtval_set
  xfs: Add delay ready attr set routines
  xfs: Add delay ready attr remove routines
  xfs: Hoist node transaction handling
  xfs: Hoist xfs_attr_leaf_addname
  xfs: Hoist xfs_attr_node_addname
  xfs: Add helper xfs_attr_node_addname_find_attr
  xfs: Separate xfs_attr_node_addname and xfs_attr_node_addname_clear_incomplete
  xfs: Refactor xfs_attr_set_shortform
  xfs: Add xfs_attr_node_remove_name
  xfs: Reverse apply 72b97ea40d
2021-06-18 08:13:22 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Ingo Molnar b2c0931a07 Merge branch 'sched/urgent' into sched/core, to resolve conflicts
This commit in sched/urgent moved the cfs_rq_is_decayed() function:

  a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

and this fresh commit in sched/core modified it in the old location:

  9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

Merge the two variants.

Conflicts:
	kernel/sched/fair.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-18 11:31:25 +02:00
Linus Torvalds 39519f6a56 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDLL74ACgkQnJ2qBz9k
 QNleSAf/XikH+tsM6K9yDEeU93GGSqKUB71n9clSQBIiGZ7/UliG0wotrUjec9Rg
 vBTZlh3JEdfboeBei+mG3hmOdAoYK4HMsJJikqRGPyWOTujh1eOZlT1LOXaY5zNM
 631A9pWe8edlpr4Mq7Wb4nO4FToEZ91iXDLliFF371aV8kP/yuv5ZjHwIn5Pt5gI
 DPnWwaJ+meW9KZ4gVKAfvZLVkKFat2xJ9r2LDpqbIkH9SBcfjBmeHOy0gFyCKx6l
 yma5iANgtWLhesP6ZwSeaRb1+T9altSLCCFZrYdKH9PXTMFUqzrbiZ8tfVmllePZ
 GaUOWcHYiLmvqvXnaAREiHnMFT6prg==
 =kevs
 -----END PGP SIGNATURE-----

Merge tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota and fanotify fixes from Jan Kara:
 "A fixup finishing disabling of quotactl_path() syscall (I've missed
  archs using different way to declare syscalls) and a fix of an fd leak
  in error handling path of fanotify"

* tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: finish disable quotactl_path syscall
  fanotify: fix copy_event_to_user() fid error clean up
2021-06-17 09:49:48 -07:00
Jens Axboe fe76421d1d io_uring: allow user configurable IO thread CPU affinity
io-wq defaults to per-node masks for IO workers. This works fine by
default, but isn't particularly handy for workloads that prefer more
specific affinities, for either performance or isolation reasons.

This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
mask that is then applied to IO thread workers, and an
IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
default of per-node.

Note that no care is given to existing IO threads, they will need to go
through a reschedule before the affinity is correct if they are already
running or sleeping.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-17 10:25:50 -06:00
Jens Axboe 0e03496d19 io-wq: use private CPU mask
In preparation for allowing user specific CPU masks for IO thread
creation, switch to using a mask embedded in the per-node wqe
structure.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-17 10:08:11 -06:00
Colin Ian King e8d46b3841 isofs: remove redundant continue statement
The continue statement in the while-loop has no effect,
remove it.

Addresses-Coverity: ("Continue has no effect")
Link: https://lore.kernel.org/r/20210617120837.11994-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-17 17:11:42 +02:00
Yang Yingliang 8f6840c4fd ext4: return error code when ext4_fill_flex_info() fails
After commit c89128a008 ("ext4: handle errors on
ext4_commit_super"), 'ret' may be set to 0 before calling
ext4_fill_flex_info(), if ext4_fill_flex_info() fails ext4_mount()
doesn't return error code, it makes 'root' is null which causes crash
in legacy_get_tree().

Fixes: c89128a008 ("ext4: handle errors on ext4_commit_super")
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20210510111051.55650-1-yangyingliang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:20 -04:00
Zhang Yi b9a037b7f3 ext4: cleanup in-core orphan list if ext4_truncate() failed to get a transaction handle
In ext4_orphan_cleanup(), if ext4_truncate() failed to get a transaction
handle, it didn't remove the inode from the in-core orphan list, which
may probably trigger below error dump in ext4_destroy_inode() during the
final iput() and could lead to memory corruption on the later orphan
list changes.

 EXT4-fs (sda): Inode 6291467 (00000000b8247c67): orphan list check failed!
 00000000b8247c67: 0001f30a 00000004 00000000 00000023  ............#...
 00000000e24cde71: 00000006 014082a3 00000000 00000000  ......@.........
 0000000072c6a5ee: 00000000 00000000 00000000 00000000  ................
 ...

This patch fix this by cleanup in-core orphan list manually if
ext4_truncate() return error.

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210507071904.160808-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Anirudh Rayabharam ce3aba4359 ext4: fix kernel infoleak via ext4_extent_header
Initialize eh_generation of struct ext4_extent_header to prevent leaking
info to userspace. Fixes KMSAN kernel-infoleak bug reported by syzbot at:
http://syzkaller.appspot.com/bug?id=78e9ad0e6952a3ca16e8234724b2fa92d041b9b8

Cc: stable@kernel.org
Reported-by: syzbot+2dcfeaf8cb49b05e8f1a@syzkaller.appspotmail.com
Fixes: a86c618126 ("[PATCH] ext3: add extent map support")
Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com>
Link: https://lore.kernel.org/r/20210506185655.7118-1-mail@anirudhrb.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Pavel Skripkin 618f003199 ext4: fix memory leak in ext4_fill_super
static int kthread(void *_create) will return -ENOMEM
or -EINTR in case of internal failure or
kthread_stop() call happens before threadfn call.

To prevent fancy error checking and make code
more straightforward we moved all cleanup code out
of kmmpd threadfn.

Also, dropped struct mmpd_data at all. Now struct super_block
is a threadfn data and struct buffer_head embedded into
struct ext4_sb_info.

Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Jiapeng Chong 1fc57ca5a2 ext4: remove redundant assignment to error
Variable error is set to zero but this value is never read as it's not
used later on, hence it is a redundant assignment and can be removed.

Cleans up the following clang-analyzer warning:

fs/ext4/ioctl.c:657:3: warning: Value stored to 'error' is never read
[clang-analyzer-deadcode.DeadStores].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619691409-83160-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Joseph Qi 5c680150d7 ext4: remove redundant check buffer_uptodate()
Now set_buffer_uptodate() will test first and then set, so we don't have
to check buffer_uptodate() first, remove it to simplify code.

Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://lore.kernel.org/r/1619418587-5580-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Jan Kara d0b040f5f2 ext4: fix overflow in ext4_iomap_alloc()
A code in iomap alloc may overflow block number when converting it to
byte offset. Luckily this is mostly harmless as we will just use more
expensive method of writing using unwritten extents even though we are
writing beyond i_size.

Cc: stable@kernel.org
Fixes: 378f32bab3 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210412102333.2676-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17 10:53:19 -04:00
Naohiro Aota f9f28e5bd0 btrfs: zoned: fix negative space_info->bytes_readonly
Consider we have a using block group on zoned btrfs.

|<- ZU ->|<- used ->|<---free--->|
                     `- Alloc offset
ZU: Zone unusable

Marking the block group read-only will migrate the zone unusable bytes
to the read-only bytes. So, we will have this.

|<- RO ->|<- used ->|<--- RO --->|

RO: Read only

When marking it back to read-write, btrfs_dec_block_group_ro()
subtracts the above "RO" bytes from the
space_info->bytes_readonly. And, it moves the zone unusable bytes back
and again subtracts those bytes from the space_info->bytes_readonly,
leading to negative bytes_readonly.

This can be observed in the output as eg.:

  Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
  Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688

This commit fixes the issue by reordering the operations.

Link: https://github.com/naota/linux/issues/37
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-17 11:12:14 +02:00
Kees Cook 1d1f6cc581 pstore/blk: Include zone in pstore_device_info
Information was redundant between struct pstore_zone_info and struct
pstore_device_info. Use struct pstore_zone_info, with member name "zone".

Additionally untangle the logic for the "best effort" block device
instance.

Signed-off-by: Kees Cook <keescook@chromium.org>
Fixed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/lkml/20210617005424.182305-1-pulehui@huawei.com
2021-06-16 21:09:31 -07:00
Kees Cook c811659bb9 pstore/blk: Fix kerndoc and redundancy on blkdev param
Remove redundant details of blkdev and fix up resulting kerndoc.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-06-16 09:27:32 -07:00
Kees Cook 7bb9557b48 pstore/blk: Use the normal block device I/O path
Stop poking into block layer internals and just open the block device
file an use kernel_read and kernel_write on it. Note that this means
the transformation from name_to_dev_t can't be used anymore when
pstore_blk is loaded as a module: a full filesystem device path name
must be used instead. Additionally removes ":internal:" kerndoc link,
since no such documentation remains.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-06-16 09:26:56 -07:00
Mike Kravetz 846be08578 mm/hugetlb: expand restore_reserve_on_error functionality
The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation.  The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation.  If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.

Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag).  There is
nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted.  This can cause issues as observed during development of
this patch [1].

One specific series of operations causing an issue is:

 - Create a shared hugetlb mapping
   Reservations for all pages created by default

 - Fault in a page in the mapping
   Reservation exists so reservation count is decremented

 - Punch a hole in the file/mapping at index previously faulted
   Reservation and any associated pages will be removed

 - Allocate a page to fill the hole
   No reservation entry, so reserve count unmodified
   Reservation entry added to map by alloc_huge_page

 - Error after allocation and before instantiating the page
   Reservation entry remains in map

 - Allocate a page to fill the hole
   Reservation entry exists, so decrement reservation count

This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.

A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo.  This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.

This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].

Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set.  In this case, we
need to remove any reserve map entry created by alloc_huge_page.  A new
helper routine vma_del_reservation assists with this operation.

There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths.  Add
those missing calls.

[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/

Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96dd ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Kees Cook 2a03ddbde1 pstore/blk: Move verify_size() macro out of function
There's no good reason for the verify_size macro to live inside the
function. Move it up with the check_size() macro and fix indenting.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-06-16 08:19:40 -07:00
Kees Cook 6eed261f48 pstore/blk: Improve failure reporting
There was no feedback on bad registration attempts. Add details on the
failure cause.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-06-16 08:19:37 -07:00
Olivier Langlois ec16d35b6c io-wq: remove header files not needed anymore
mm related header files are not needed for io-wq module.
remove them for a small clean-up.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16 06:41:49 -06:00
Olivier Langlois 236daeae36 io_uring: Add to traces the req pointer when available
The req pointer uniquely identify a specific request.
Having it in traces can provide valuable insights that is not possible
to have if the calling process is reusing the same user_data value.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16 06:41:46 -06:00
Pavel Begunkov 2335f6f5dd io_uring: optimise io_commit_cqring()
In most cases io_commit_cqring() is just an smp_store_release(), and
it's hot enough, especially for IRQ rw, to want it to save on a function
call. Mark it inline and extract a non-inlined slow path doing drain
and timeout flushing. The inlined part is pretty slim to not cause
binary bloating.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7350f8b6b92caa50a48a80be39909f0d83eddd93.1623772051.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:44:34 -06:00
Pavel Begunkov 3c19966d37 io_uring: shove more drain bits out of hot path
Place all drain_next logic into io_drain_req(), so it's never executed
if there was no drained requests before. The only thing we need is to
set ->drain_active if we see a request with IOSQE_IO_DRAIN, do that in
io_init_req() where flags are definitely in registers.

Also, all drain-related code is encapsulated in io_drain_req(), makes it
cleaner.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/68bf4f7395ddaafbf1a26bd97b57d57d45a9f900.1623772051.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:44:34 -06:00
Pavel Begunkov 10c669040e io_uring: switch !DRAIN fast path when possible
->drain_used is one way, which is not optimal if users use DRAIN but
very rarely. However, we can just clear it in io_drain_req() when all
drained before requests are gone. Also rename the flag to reflect the
change and be more clear about it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7f37a240857546a94df6348507edddacab150460.1623772051.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:44:33 -06:00
Pavel Begunkov 27f6b318de io_uring: fix min types mismatch in table alloc
fs/io_uring.c: In function 'io_alloc_page_table':
include/linux/minmax.h:20:28: warning: comparison of distinct pointer
	types lacks a cast

Cast everything to size_t using min_t.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 9123c8ffce ("io_uring: add helpers for 2 level table alloc")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/50f420a956bca070a43810d4a805293ed54f39d8.1623759527.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:40:17 -06:00
Fam Zheng dd9ae8a0b2 io_uring: Fix comment of io_get_sqe
The sqe_ptr argument has been gone since 709b302fad (io_uring:
simplify io_get_sqring, 2020-04-08), made the return value of the
function. Update the comment accordingly.

Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210604164256.12242-1-fam.zheng@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:39:16 -06:00
Pavel Begunkov 441b8a7803 io_uring: optimise non-drain path
Replace drain checks with one-way flag set upon seeing the first
IOSQE_IO_DRAIN request. There are several places where it cuts cycles
well:

1) It's much faster than the fast check with two
conditions in io_drain_req() including pretty complex
list_empty_careful().

2) We can mark io_queue_sqe() inline now, that's a huge win.

3) It replaces timeout and drain checks in io_commit_cqring() with a
single flags test. Also great not touching ->defer_list there without a
reason so limiting cache bouncing.

It adds a small amount of overhead to drain path, but it's negligible.
The main nuisance is that once it meets any DRAIN request in io_uring
instance lifetime it will _always_ go through a slower path, so
drain-less and offset-mode timeout less applications are preferable.
The overhead in that case would be not big, but it's worth to bear in
mind.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98d2fff8c4da5144bb0d08499f591d4768128ea3.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 76cc33d791 io_uring: refactor io_req_defer()
Rename io_req_defer() into io_drain_req() and refactor it uncoupling it
from io_queue_sqe() error handling and preparing for coming
optimisations. Also, prioritise non IOSQE_ASYNC path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4f17dd56e7fbe52d1866f8acd8efe3284d2bebcb.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 0499e582aa io_uring: move uring_lock location
->uring_lock is prevalently used for submission, even though it protects
many other things like iopoll, registeration, selected bufs, and more.
And it's placed together with ->cq_wait poked on completion and CQ
waiting sides. Move them apart, ->uring_lock goes to the submission
data, and cq_wait to completion related chunk. The last one requires
some reshuffling so everything needed by io_cqring_ev_posted*() is in
one cacheline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dea5e845caee4c98aa0922b46d713154d81f7bd8.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 311997b3fc io_uring: wait heads renaming
We use several wait_queue_head's for different purposes, but namings are
confusing. First rename ctx->cq_wait into ctx->poll_wait, because this
one is used for polling an io_uring instance. Then rename ctx->wait into
ctx->cq_wait, which is responsible for CQE waiting.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/47b97a097780c86c67b20b6ccc4e077523dce682.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 5ed7a37d21 io_uring: clean up check_overflow flag
There are no users of ->sq_check_overflow, only ->cq_check_overflow is
used. Combine it and move out of completion related part of struct
io_ring_ctx.

A not so obvious benefit of it is fitting all completion side fields
into a single cacheline. It was taking 2 lines before with 56B padding,
and io_cqring_ev_posted*() were still touching both of them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/25927394964df31d113e3c729416af573afff5f5.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 5e159204d7 io_uring: small io_submit_sqe() optimisation
submit_state.link is used only to assemble a link and not used for
actual submission, so clear it before io_queue_sqe() in io_submit_sqe(),
awhile it's hot and in caches and queueing doesn't spoil it. May also
potentially help compiler with spilling or to do other optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1579939426f3ad6b55af3005b1389bbbed7d780d.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov f18ee4cf0a io_uring: optimise completion timeout flushing
io_commit_cqring() might be very hot and we definitely don't want to
touch ->timeout_list there, because 1) it's shared with the submission
side so might lead to cache bouncing and 2) may need to load an extra
cache line, especially for IRQ completions.

We're interested in it at the completion side only when there are
offset-mode timeouts, which are not so popular. Replace
list_empty(->timeout_list) hot path check with a new one-way flag, which
is set when we prepare the first offset-mode timeout.

note: the flag sits in the same line as briefly used after ->rings

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4892ec68b71a69f92ffbea4a1499be3ec0d463b.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 15641e4270 io_uring: don't cache number of dropped SQEs
Kill ->cached_sq_dropped and wire DRAIN sequence number correction via
->cq_extra, which is there exactly for that purpose. User visible
dropped counter will be populated by incrementing it instead of keeping
a copy, similarly as it was done not so long ago with cq_overflow.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/088aceb2707a534d531e2770267c4498e0507cc1.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:40 -06:00
Pavel Begunkov 17d3aeb33c io_uring: refactor io_get_sqe()
The line of io_get_sqe() evaluating @head consists of too many
operations including READ_ONCE(), it's not convenient for probing.
Refactor it also improving readability.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/866ad6e4ef4851c7c61f6b0e08dbd0a8d1abce84.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:39 -06:00
Pavel Begunkov 7f1129d227 io_uring: shuffle more fields into SQ ctx section
Since moving locked_free_* out of struct io_submit_state
ctx->submit_state is accessed on submission side only, so move it into
the submission section. Same goes for rsrc table pointers/nodes/etc.,
they must be taken and checked during submission because sync'ed by
uring_lock, so move them there as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8a5899a50afc6ccca63249e716f580b246f3dec6.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:39 -06:00
Pavel Begunkov b52ecf8cb5 io_uring: move ctx->flags from SQ cacheline
ctx->flags are heavily used by both, completion and submission sides, so
move it out from the ctx fields related to submissions. Instead, place
it together with ctx->refs, because it's already cacheline-aligned and
so pads lots of space, and both almost never change. Also, in most
occasions they are accessed together as refs are taken at submission
time and put back during completion.

Do same with ctx->rings, where the pointer itself is never modified
apart from ring init/free.

Note: in percpu mode, struct percpu_ref doesn't modify the struct itself
but takes indirection with ref->percpu_count_ptr.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4c48c173e63d35591383ba2b87e8b8e8dfdbd23d.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:39 -06:00
Pavel Begunkov c7af47cf0f io_uring: keep SQ pointers in a single cacheline
sq_array and sq_sqes are always used together, however they are in
different cachelines, where the borderline is right before
cq_overflow_list is rather rarely touched. Move the fields together so
it loads only one cacheline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ef2411a94874da06492506a8897eff679244f49.1623709150.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:38:39 -06:00
Colin Ian King b1b2fc3574 io-wq: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210615143424.60449-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:37:51 -06:00
Colin Ian King fdd1dc316e io_uring: Fix incorrect sizeof operator for copy_from_user call
Static analysis is warning that the sizeof being used is should be
of *data->tags[i] and not data->tags[i]. Although these are the same
size on 64 bit systems it is not a portable assumption to assume
this is true for all cases.  Fix this by using a temporary pointer
tag_slot to make the code a clearer.

Addresses-Coverity: ("Sizeof not portable")
Fixes: d878c81610 ("io_uring: hide rsrc tag copy into generic helpers")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20210615130011.57387-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-15 15:37:11 -06:00
Linus Torvalds 94f0b2d4a1 proc: only require mm_struct for writing
Commit 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.

But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too.  And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.

Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected.  The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.

Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c14d ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: Leon Romanovsky <leon@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-15 10:47:51 -07:00
Ian Kent d826e03651 kernfs: move revalidate to be near lookup
While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

It makes sense to locate this function near to kernfs_iop_lookup()
because we will be adding VFS negative dentry caching to reduce path
lookup overhead for non-existent paths.

There's no functional change from this patch.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/162375275365.232295.8995526416263659926.stgit@web.messagingengine.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-15 17:04:32 +02:00
Dan Carpenter a33d62662d afs: Fix an IS_ERR() vs NULL check
The proc_symlink() function returns NULL on error, it doesn't return
error pointers.

Fixes: 5b86d4ff5d ("afs: Implement network namespacing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YLjMRKX40pTrJvgf@mwanda/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-15 07:42:26 -07:00
Christoph Hellwig d07f3b081e mark pstore-blk as broken
pstore-blk just pokes directly into the pagecache for the block
device without going through the file operations for that by faking
up it's own file operations that do not match the block device ones.

As this breaks the control of the block layer of it's page cache,
and even now just works by accident only the best thing is to just
disable this driver.

Fixes: 17639f67c1 ("pstore/blk: Introduce backend for block devices")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210608161327.1537919-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:26:03 -06:00
Pavel Begunkov aeab9506ef io_uring: inline io_iter_do_read()
There are only two calls in source code of io_iter_do_read(), the
function is small and pretty hot though is failed to get inlined.
Makr it as inline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/25a26dae7660da73fbc2244b361b397ef43d3caf.1623634182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 78cc687be9 io_uring: unify SQPOLL and user task cancellations
Merge io_uring_cancel_sqpoll() and __io_uring_cancel() as it's easier to
have a conditional ctx traverse inside than keeping them in sync.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/adfe24d6dad4a3883a40eee54352b8b65ac851bb.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 09899b1915 io_uring: cache task struct refs
tctx in submission part is always synchronised because is executed from
the task's context, so we can batch allocate tctx/task references and
store them across syscall boundaries. It avoids enough of operations,
including an atomic for getting task ref and a percpu_counter_add()
function call, which still fallback to spinlock for large batching
cases (around >=32). Should be good for SQPOLL submitting in small
portions and coming at some moment bpf submissions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/14b327b973410a3eec1f702ecf650e100513aca9.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 2d091d62b1 io_uring: don't vmalloc rsrc tags
We don't really need vmalloc for keeping tags, it's not a hot path and
is there out of convenience, so replace it with two level tables to not
litter kernel virtual memory mappings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/241a3422747113a8909e7e1030eb585d4a349e0d.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 9123c8ffce io_uring: add helpers for 2 level table alloc
Some parts like fixed file table use 2 level tables, factor out helpers
for allocating/deallocating them as more users are to come.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1709212359cd82eb416d395f86fc78431ccfc0aa.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 157d257f99 io_uring: remove rsrc put work irq save/restore
io_rsrc_put_work() is executed by workqueue in non-irq context, so no
need for irqsave/restore variants of spinlocking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a7f77220735f4ad404ac885b4d73bdf42d2f836.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov d878c81610 io_uring: hide rsrc tag copy into generic helpers
Make io_rsrc_data_alloc() taking care of rsrc tags loading on
registration, so we don't need to repeat it for each new rsrc type.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5609680697bd09735de10561b75edb95283459da.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov e587227b68 io-wq: simplify worker exiting
io_worker_handle_work() already takes care of the empty list case and
releases spinlock, so get rid of ugly conditional unlocking and
unconditionally call handle_work()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7521e485677f381036676943e876a0afecc23017.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov 769e683715 io-wq: don't repeat IO_WQ_BIT_EXIT check by worker
io_wqe_worker()'s main loop does check IO_WQ_BIT_EXIT flag, so no need
for a second test_bit at the end as it will immediately jump to the
first check afterwards.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6af4a51c86523a527fb5417c9fbc775c4b26497.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:13 -06:00
Pavel Begunkov eef51daa72 io_uring: rename function *task_file
What at some moment was references to struct file used to control
lifetimes of task/ctx is now just internal tctx structures/nodes,
so rename outdated *task_file() routines into something more sensible.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e2fbce42932154c2631ce58ffbffaa232afe18d5.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:12 -06:00
Pavel Begunkov cb3d8972c7 io_uring: refactor io_iopoll_req_issued
A simple refactoring of io_iopoll_req_issued(), move in_async inside so
we don't pass it around and save on double checking it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1513bfde4f0c835be25ac69a82737ab0668d7665.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:12 -06:00
Pavel Begunkov 382cb03046 io-wq: remove unused io-wq refcounting
iowq->refs is initialised to one and killed on exit, so it's not used
and we can kill it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/401007393528ea7c102360e69a29b64498e15db2.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:12 -06:00
Pavel Begunkov c7f405d6fa io-wq: embed wqe ptr array into struct io_wq
io-wq keeps an array of pointers to struct io_wqe, allocate this array
as a part of struct io-wq, it's easier to code and saves an extra
indirection for nearly each io-wq call.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1482c6a001923bbed662dc38a8a580fb08b1ed8c.1623634181.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:12 -06:00
Pavel Begunkov 976517f162 io_uring: fix blocking inline submission
There is a complaint against sys_io_uring_enter() blocking if it submits
stdin reads. The problem is in __io_file_supports_async(), which
sees that it's a cdev and allows it to be processed inline.

Punt char devices using generic rules of io_file_supports_async(),
including checking for presence of *_iter() versions of rw callbacks.
Apparently, it will affect most of cdevs with some exceptions like
null and zero devices.

Cc: stable@vger.kernel.org
Reported-by: Birk Hirdman <lonjil@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d60270856b8a4560a639ef5f76e55eb563633599.1623236455.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov 40dad765c0 io_uring: enable shmem/memfd memory registration
Relax buffer registration restictions, which filters out file backed
memory, and allow shmem/memfd as they have normal anonymous pages
underneath.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov d0acdee296 io_uring: don't bounce submit_state cachelines
struct io_submit_state contains struct io_comp_state and so
locked_free_*, that renders cachelines around ->locked_free* being
invalidated on most non-inline completions, that may terrorise caches if
submissions and completions are done by different tasks.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/290cb5412b76892e8631978ee8ab9db0c6290dd5.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov d068b5068d io_uring: rename io_get_cqring
Rename io_get_cqring() into io_get_cqe() for consistency with SQ, and
just because the old name is not as clear.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a46a53e3f781de372f5632c184e61546b86515ce.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov 8f6ed49a44 io_uring: kill cached_cq_overflow
There are two copies of cq_overflow, shared with userspace and internal
cached one. It was needed for DRAIN accounting, but now we have yet
another knob to tune the accounting, i.e. cq_extra, and we can throw
away the internal counter and just increment the one in the shared ring.

If user modifies it as so never gets the right overflow value ever
again, it's its problem, even though before we would have restored it
back by next overflow.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8427965f5175dd051febc63804909861109ce859.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov ea5ab3b579 io_uring: deduce cq_mask from cq_entries
No need to cache cq_mask, it's exactly cq_entries - 1, so just deduce
it to not carry it around.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d439efad0503c8398451dae075e68a04362fbc8d.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov a566c5562d io_uring: remove dependency on ring->sq/cq_entries
We have numbers of {sq,cq} entries cached in ctx, don't look up them in
user-shared rings as 1) it may fetch additional cacheline 2) user may
change it and so it's always error prone.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/745d31bc2da41283ddd0489ef784af5c8d6310e9.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:05 -06:00
Pavel Begunkov b13a8918d3 io_uring: better locality for rsrc fields
ring has two types of resource-related fields: used for request
submission, and field needed for update/registration. Reshuffle them
into these two groups for better locality and readability. The second
group is not in the hot path, so it's natural to place them somewhere in
the end. Also update an outdated comment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/05b34795bb4440f4ec4510f08abd5a31830f8ca0.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov b986af7e2d io_uring: shuffle rarely used ctx fields
There is a bunch of scattered around ctx fields that are almost never
used, e.g. only on ring exit, plunge them to the end, better locality,
better aesthetically.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782ff94b00355923eae757d58b1a47821b5b46d4.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov 93d2bcd2cb io_uring: make fail flag not link specific
The main difference is in req_set_fail_links() renamed into
req_set_fail(), which now sets REQ_F_FAIL_LINK/REQ_F_FAIL flag
unconditional on whether it has been a link or not. It only matters in
io_disarm_next(), which already handles it well, and all calls to it
have a fast path checking REQ_F_LINK/HARDLINK.

It looks cleaner, and sheds binary size
   text    data     bss     dec     hex filename
  84235   12390       8   96633   17979 ./fs/io_uring.o
  84151   12414       8   96573   1793d ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e2224154dd6e53b665ac835d29436b177872fa10.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov 3dd0c97a9e io_uring: get rid of files in exit cancel
We don't match against files on cancellation anymore, so no need to drag
around files_struct anymore, just pass a flag telling whether only
inflight or all requests should be killed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7bfc5409a78f8e2d6b27dec3293ec2d248677348.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov acfb381d9d io_uring: simplify waking sqo_sq_wait
Going through submission in __io_sq_thread() and still having a full SQ
is rather unexpected, so remove a check for SQ fullness and just wake up
whoever wait on sqo_sq_wait. Also skip if it doesn't do submission in
the first place, likely may to happen for SQPOLL sharing and/or IOPOLL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e2e91751e87b1a39f8d63ef884aaff578123f61e.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov 21f2fc080f io_uring: remove unused park_task_work
As sqpoll cancel via task_work is killed, remove everything related to
park_task_work as it's not used anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/310d8b76a2fbbf3e139373500e04ad9af7ee3dbb.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov aaa9f0f481 io_uring: improve sq_thread waiting check
If SQPOLL task finds a ring requesting it to continue running, no need
to set wake flag to rest of the rings as it will be cleared in a moment
anyway, so hide it in a single sqd->ctx_list loop.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ee5a696d9fd08645994c58ee147d149a8957d94.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Pavel Begunkov e4b6d902a9 io_uring: improve sqpoll event/state handling
As sqd->state changes rarely, don't check every event one by one but
look them all at once. Add a helper function. Also don't go into event
waiting sleeping with STOP flag set.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/645025f95c7eeec97f88ff497785f4f1d6f3966f.1621201931.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-14 08:23:04 -06:00
Matthew Bobrowski f644bc449b fanotify: fix copy_event_to_user() fid error clean up
Ensure that clean up is performed on the allocated file descriptor and
struct file object in the event that an error is encountered while copying
fid info objects. Currently, we return directly to the caller when an error
is experienced in the fid info copying helper, which isn't ideal given that
the listener process could be left with a dangling file descriptor in their
fdtable.

Fixes: 5e469c830f ("fanotify: copy event fid info to user")
Fixes: 44d705b037 ("fanotify: report name info for FAN_DIR_MODIFY event")
Link: https://lore.kernel.org/linux-fsdevel/YMKv1U7tNPK955ho@google.com/T/#m15361cd6399dad4396aad650de25dbf6b312288e
Link: https://lore.kernel.org/r/1ef8ae9100101eb1a91763c516c2e9a3a3b112bd.1623376346.git.repnop@google.com
Signed-off-by: Matthew Bobrowski <repnop@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-14 12:16:37 +02:00
Greg Kroah-Hartman 68afbd8459 Linux 5.13-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmDGe+4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/IUH/iyHVulAtAhL9bnR
 qL4M1kWfcG1sKS2TzGRZzo6YiUABf89vFP90r4sKxG3AKrb8YkTwmJr8B/sWwcsv
 PpKkXXTobbDfpSrsXGEapBkQOE7h2w739XeXyBLRPkoCR4UrEFn68TV2rLjMLBPS
 /EIZkonXLWzzWalgKDP4wSJ7GaQxi3LMx3dGAvbFArEGZ1mPHNlgWy2VokFY/yBf
 qh1EZ5rugysc78JCpTqfTf3fUPK2idQW5gtHSMbyESrWwJ/3XXL9o1ET3JWURYf1
 b0FgVztzddwgULoIGWLxDH5WWts3l54sjBLj0yrLUlnGKA5FjrZb12g9PdhdywuY
 /8KfjeE=
 =JfJm
 -----END PGP SIGNATURE-----

Merge tag 'v5.13-rc6' into driver-core-next

We need the driver core fix in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-14 09:07:45 +02:00
Greg Kroah-Hartman db4e54aefd Linux 5.13-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmDGe+4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/IUH/iyHVulAtAhL9bnR
 qL4M1kWfcG1sKS2TzGRZzo6YiUABf89vFP90r4sKxG3AKrb8YkTwmJr8B/sWwcsv
 PpKkXXTobbDfpSrsXGEapBkQOE7h2w739XeXyBLRPkoCR4UrEFn68TV2rLjMLBPS
 /EIZkonXLWzzWalgKDP4wSJ7GaQxi3LMx3dGAvbFArEGZ1mPHNlgWy2VokFY/yBf
 qh1EZ5rugysc78JCpTqfTf3fUPK2idQW5gtHSMbyESrWwJ/3XXL9o1ET3JWURYf1
 b0FgVztzddwgULoIGWLxDH5WWts3l54sjBLj0yrLUlnGKA5FjrZb12g9PdhdywuY
 /8KfjeE=
 =JfJm
 -----END PGP SIGNATURE-----

Merge tag 'v5.13-rc6' into char-misc-next

We need the fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-14 08:59:06 +02:00
Trond Myklebust 3731d44bba NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
Fix an Oopsable condition in pnfs_mark_request_commit() when we're
putting a set of writes on the commit list to reschedule them after a
failed pNFS attempt.

Fixes: 9c455a8c1e ("NFS/pNFS: Clean up pNFS commit operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13 19:36:49 -04:00
Trond Myklebust dd99e9f98f NFSv4: Initialise connection to the server in nfs4_alloc_client()
Set up the connection to the NFSv4 server in nfs4_alloc_client(), before
we've added the struct nfs_client to the net-namespace's nfs_client_list
so that a downed server won't cause other mounts to hang in the trunking
detection code.

Reported-by: Michael Wakabayashi <mwakabayashi@vmware.com>
Fixes: 5c6e5b60aa ("NFS: Fix an Oops in the pNFS files and flexfiles connection setup to the DS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13 19:36:49 -04:00
Trond Myklebust e93a5e9306 NFSv4: Add support for application leases underpinned by a delegation
If the NFSv4 client already holds a delegation for a file, then we can
support application leases (i.e. fcntl(fd, F_SETLEASE,...)) because the
underlying delegation guarantees that the file is not being modified on
the server by another client in a way that might conflict with the lease
guarantees.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13 19:36:28 -04:00
Trond Myklebust 6b4befc0a0 NFSv4: Add lease breakpoints in case of a delegation recall or return
When we add support for application level leases and knfsd delegations
to the NFS client, we we want to have them safely underpinned by a
"real" delegation to provide the caching guarantees. If that real
delegation is recalled, then we need to ensure that the application
leases/delegations are recalled too.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13 19:36:28 -04:00
Trond Myklebust be20037725 NFSv4: Fix delegation return in cases where we have to retry
If we're unable to immediately recover all locks because the server is
unable to immediately service our reclaim calls, then we want to retry
after we've finished servicing all the other asynchronous delegation
returns on our queue.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-13 19:36:27 -04:00
Linus Torvalds 960f0716d8 NFS client bugfixes for Linux 5.13
Highlights include
 
 Stable fixes:
 - Fix use-after-free in nfs4_init_client()
 
 Bugfixes:
 - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
 - Fix second deadlock in nfs4_evict_inode()
 - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP
 - Fix setting of the NFS_CAP_SECURITY_LABEL capability
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmDGJPEACgkQZwvnipYK
 APLH8xAAsdoKVCW35P+FtlzQvq0iWoTvk15i4Jv8+SyFtqAZe6y6pEj9+RT47CAV
 kt/uNa6CQ9KjxxgwBf2XoGTuf4MrOUU34kQBF/tRLy9zDdXUsZH263vapopmel6L
 BVHEEsID6hz8+BUt1LFsr+8sWxG+12UiimEu0CVo4BE8SgYushWpJOQ9iL/zxi1O
 gXmlAfA9g38I9aUApke4hOPSHVTGaQaAKl5LbSoycQlJblzgA1yIXdU9sVTHDJY6
 sco9O9M+NPY8gefS4d7iXSihZin5V9rNuSJ9SKiCPikTEjZYgZbw1umGj6VnF/5e
 QD47QGgOwXKeCOBv6Oe4VYxE2JISoUFZw8+pxjy4eDO+EcJv3IrHOM8UrsiddGAA
 DLHzbbrMUx6mGdgibw/ktkwx0Q/DvGrfrvKidk33cs16DPWgTZAG//n7spuqYTmT
 8fQbJF6DDjsYM7v+WdImf7VBA8dreXb/QcHwxCtH7uG+hGyRiYoDSOmH3mGBKpLX
 idkjz6Hvj7V7Y1z4qd+nvh4Ch1V0b9BX+J/+6dKHRykpmSJTIMIlQw7/wA6a8Lp6
 WJX4KbUzZHojvqM1BMzRL34+qidihUso0RIj0VjCB1JQyosRnIeTPorfHLQZTOM0
 IjP8h48BB7E7cJeJP1dmhvm7Hb8SpFVDxDHoWRtscbQflO3Wdkw=
 =PABi
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - Fix use-after-free in nfs4_init_client()

  Bugfixes:

   - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()

   - Fix second deadlock in nfs4_evict_inode()

   - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP

   - Fix setting of the NFS_CAP_SECURITY_LABEL capability"

* tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix second deadlock in nfs4_evict_inode()
  NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
  NFS: FMODE_READ and friends are C macros, not enum types
  NFS: Fix a potential NULL dereference in nfs_get_client()
  NFS: Fix use-after-free in nfs4_init_client()
  NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
  NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
2021-06-13 12:32:59 -07:00
Linus Torvalds 87a7f7368b Driver core fix for 5.13-rc6
Here is a single debugfs fix for 5.13-rc6.
 
 It fixes a bug in debugfs_read_file_str() that showed up in 5.13-rc1.
 
 It has been in linux-next for a full week with no reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYMTWug8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yl5MQCeMMEMCGsoQdeXI1t2WMAMTmWRTZYAn1GqGliM
 b3RkczkNgKnEfDB2+M1r
 =wWW8
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "A single debugfs fix for 5.13-rc6, fixing a bug in
  debugfs_read_file_str() that showed up in 5.13-rc1.

  It has been in linux-next for a full week with no
  reported problems"

* tag 'driver-core-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  debugfs: Fix debugfs_read_file_str()
2021-06-12 12:18:49 -07:00
Linus Torvalds b2568eeb96 io_uring-5.13-2021-06-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDEwEEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpu2uEACIZXc0e4Jz2tJmtlLzhm0T+YUXu88/n0Ki
 3HsCfjyk0k2tvGjAmzLgBruR+0dxuoTlC8ZyLWkCgYFvRxCQMrjxB4+Q53WAAPud
 ictv/5C992eWfmkk5lKWYh/SVUZU0nN/HlcITggFzH+/Ek4RgqBJK6rYPpN4YM6W
 OifSZ22xwjZy9i8svzCPzGUbS5d5qbNeRSaacfADWFmzTqqzllWz/KkN633UFefR
 tkqWy610P0O8fz3xe5HcECIOc3aNRZuk5zrNqCJPvxcOdYlqlL/HfsWMACEiC/g1
 N3ahNGrUzJqhB1QNAIKATKAlh8hzAws9t/alLJQzSHZWRu7vso0qctoVJT3i6xRp
 qD17EAQgrC0R0fQxdHmoMzRHEnKPCXQx36wb/mhZbG60/Q+scmSrFXp86XvbKZiI
 uzHTsUL/80bRXHuVrKXT+JWTRCzpv1yk9ufIVzSOheVCl/H6bxZ29cabBL2/XvvI
 d+OljDsy7oMH6rOBFi3XYmwZShEoUqeATeFoFf5isjkWfe7qdiMVu4apD8fBhIjX
 8rNLjp0nIKN+5IjHwFkAXRwp8P1SJQ8c7Tl4I6xY82FsMQxUUgMhjSqrn58i2g9d
 Lem9YHKaXIbw1yfWcaf8erA6d0S4rujG+j3miG0y248kOTb9FeMbfbRgjj8v99m1
 XB7F9SIQUw==
 =MbrN
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Just an API change for the registration changes that went into this
  release. Better to get it sorted out now than before it's too late"

* tag 'io_uring-5.13-2021-06-12' of git://git.kernel.dk/linux-block:
  io_uring: add feature flag for rsrc tags
  io_uring: change registration/upd/rsrc tagging ABI
2021-06-12 11:53:20 -07:00
Alexander Aring 957adb68b3 fs: dlm: invalid buffer access in lookup error
This patch will evaluate the message length if a dlm opts header can fit
in before accessing it if a node lookup fails. The invalid sequence
error means that the version detection failed and an unexpected message
arrived. For debugging such situation the type of arrived message is
important to know.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-11 12:44:47 -05:00
Alexander Aring f5fe8d5107 fs: dlm: fix race in mhandle deletion
This patch fixes a race between mhandle deletion in case of receiving an
acknowledge and flush of all pending mhandle in cases of an timeout or
resetting node states.

Fixes: 489d8e559c ("fs: dlm: add reliable connection if reconnect")
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-11 12:44:47 -05:00
Pavel Begunkov 9690557e22 io_uring: add feature flag for rsrc tags
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of
new IORING_REGISTER operations, in particular
IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc
tagging, and also indicating implemented dynamic fixed buffer updates.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 16:33:51 -06:00
Pavel Begunkov 992da01aa9 io_uring: change registration/upd/rsrc tagging ABI
There are ABI moments about recently added rsrc registration/update and
tagging that might become a nuisance in the future. First,
IORING_REGISTER_RSRC[_UPD] hide different types of resources under it,
so breaks fine control over them by restrictions. It works for now, but
once those are wanted under restrictions it would require a rework.

It was also inconvenient trying to fit a new resource not supporting
all the features (e.g. dynamic update) into the interface, so better
to return to IORING_REGISTER_* top level dispatching.

Second, register/update were considered to accept a type of resource,
however that's not a good idea because there might be several ways of
registration of a single resource type, e.g. we may want to add
non-contig buffers or anything more exquisite as dma mapped memory.
So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them
internally for now to limit changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-10 16:33:51 -06:00
Eric W. Biederman 06af867944 coredump: Limit what can interrupt coredumps
Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.

Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>    set_notify_signal()

The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals.   Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.

Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports.  This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.

The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending.  So it should
not develop surprising wake-up reasons either.

v2: Don't remove the now unnecessary code in prepare_signal.

Cc: stable@vger.kernel.org
Fixes: 12db8b6900 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-10 14:02:29 -07:00
Masami Hiramatsu ca24306d83 bootconfig: Change array value to use child node
It is not possible to put an array value with subkeys under
a key node, because both of subkeys and the array elements
are using "next" field of the xbc_node.

Thus this changes the array values to use "child" field in
the array case. The reason why split this change is to
test it easily.

Link: https://lkml.kernel.org/r/162262193838.264090.16044473274501498656.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-10 13:38:25 -04:00
Al Viro f0b65f39ac iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it.  In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed.  That, however, needs to be done only on slow
paths.

All in-tree callers converted.  And that kills the last user of iterate_all_kinds()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-10 11:45:14 -04:00
Linus Torvalds cc6cf827dd for-5.13-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDAtXUACgkQxWXV+ddt
 WDtbdA//ccQ8JL5yC/x/j0ZXLJ2INqXpxIUPjadwwEjtTgOllvx+f1nU0QazeYfM
 XvvzDDvpemWajC2Ii54s2HCQbG+dAzO1YBl1XCyve91T0GeNGhzytZwM0pVxZePQ
 A+aOyVH7IcfFcmBy9T0yctqiGgtD3lre208kU9kolidsIyomLHxBckBhMYDXvJCK
 BOdrjq3f6H5J0zqOqAnWdc/Wc5z5pw3CHxlIuoA3Tp0Gv9TIx366Z/IvmFfCyvCt
 kYv2qnUaw10OlFLiqhetlZyv49ibW4waj0RbyY/rZx+69sE/PM4961NYAjLoFJc2
 6OoZZO4OHWrNZpBJfbyyX9KVLspix075FID7qVhE/AVW4CYZGOFu5wJyXQiYlysH
 1qqkihK3gbKEsB2429UeLZktupmx79LBIgg346+DSQYiMXMTGR8iZY1onbBM2wlf
 bep65hsiHhxoC6Z/KhxrTGZM2jyYW2nICw3o0xikhWv7MZPWKfKHrH9NJQ9Lpuhy
 gxut0ef9HbPXWP9PgRmY0Z8PsUi8RT1bv0bHVw7EnhLbi62neJLyxY3Q++W+7vBG
 LYeaxKWLTTJu73wpBQHLI0pD0UifXLrTkiCI+4gN8zVfzxUl+90mGz2AdSRRFI+U
 kNdX/haEHi00WBqYxWt33ae/FuSHjPuYXjiPQA7Kiy/C3n9GAB0=
 =mGAq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes that people hit during testing.

  Zoned mode fix:

   - fix 32bit value wrapping when calculating superblock offsets

  Error handling fixes:

   - properly check filesystema and device uuids

   - properly return errors when marking extents as written

   - do not write supers if we have an fs error"

* tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: promote debugging asserts to full-fledged checks in validate_super
  btrfs: return value from btrfs_mark_extent_written() in case of error
  btrfs: zoned: fix zone number to sector/physical calculation
  btrfs: do not write supers if we have an fs error
2021-06-09 13:34:48 -07:00
Allison Henderson 816c8e39b7 xfs: Make attr name schemes consistent
This patch renames the following functions to make the nameing scheme more consistent:
xfs_attr_shortform_remove -> xfs_attr_sf_removename
xfs_attr_node_remove_name -> xfs_attr_node_removename
xfs_attr_set_fmt -> xfs_attr_sf_addname

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-09 09:34:05 -07:00
Allison Henderson 4a4957c16d xfs: Fix default ASSERT in xfs_attr_set_iter
This ASSERT checks for the state value of RM_SHRINK in the set path
which should never happen.  Change to ASSERT(0);

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-09 09:33:14 -07:00
Greg Kurz e4a9ccdd1c fuse: Fix infinite loop in sget_fc()
We don't set the SB_BORN flag on submounts. This is wrong as these
superblocks are then considered as partially constructed or dying
in the rest of the code and can break some assumptions.

One such case is when you have a virtiofs filesystem with submounts
and you try to mount it again : virtio_fs_get_tree() tries to obtain
a superblock with sget_fc(). The logic in sget_fc() is to loop until
it has either found an existing matching superblock with SB_BORN set
or to create a brand new one. It is assumed that a superblock without
SB_BORN is transient and the loop is restarted. Forgetting to set
SB_BORN on submounts hence causes sget_fc() to retry forever.

Setting SB_BORN requires special care, i.e. a write barrier for
super_cache_count() which can check SB_BORN without taking any lock.
We should call vfs_get_tree() to deal with that but this requires
to have a proper ->get_tree() implementation for submounts, which
is a bigger piece of work. Go for a simple bug fix in the meatime.

Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Greg Kurz e3a43f2a95 fuse: Fix crash if superblock of submount gets killed early
As soon as fuse_dentry_automount() does up_write(&sb->s_umount), the
superblock can theoretically be killed. If this happens before the
submount was added to the &fc->mounts list, fuse_mount_remove() later
crashes in list_del_init() because it assumes the submount to be
already there.

Add the submount before dropping sb->s_umount to fix the inconsistency.
It is okay to nest fc->killsb under sb->s_umount, we already do this
on the ->kill_sb() path.

Signed-off-by: Greg Kurz <groug@kaod.org>
Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Greg Kurz d92d88f056 fuse: Fix crash in fuse_dentry_automount() error path
If fuse_fill_super_submount() returns an error, the error path
triggers a crash:

[   26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
[   26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90
[...]
[   26.247938] Call Trace:
[   26.248300]  fuse_mount_remove+0x2c/0x70 [fuse]
[   26.248892]  virtio_kill_sb+0x22/0x160 [virtiofs]
[   26.249487]  deactivate_locked_super+0x36/0xa0
[   26.250077]  fuse_dentry_automount+0x178/0x1a0 [fuse]

The crash happens because fuse_mount_remove() assumes that the FUSE
mount was already added to list under the FUSE connection, but this
only done after fuse_fill_super_submount() has returned success.

This means that until fuse_fill_super_submount() has returned success,
the FUSE mount isn't actually owned by the superblock. We should thus
reclaim ownership by clearing sb->s_fs_info, which will skip the call
to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree()
already does for the root sb.

Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Kees Cook 591a22c14d proc: Track /proc/$pid/attr/ opener mm_struct
Commit bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Fixes: bfb819ea20 ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-08 10:24:09 -07:00
Darrick J. Wong b26b2bf14f xfs: rename struct xfs_eofblocks to xfs_icwalk
The xfs_eofblocks structure is no longer well-named -- nowadays it
provides optional filtering criteria to any walk of the incore inode
cache.  Only one of the cache walk goals has anything to do with
clearing of speculative post-EOF preallocations, so change the name to
be more appropriate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-08 09:30:20 -07:00
Darrick J. Wong 2d53f66baf xfs: change the prefix of XFS_EOF_FLAGS_* to XFS_ICWALK_FLAG_
In preparation for renaming struct xfs_eofblocks to struct xfs_icwalk,
change the prefix of the existing XFS_EOF_FLAGS_* flags to
XFS_ICWALK_FLAG_ and convert all the existing users.  This adds a degree
of interface separation between the ioctl definitions and the incore
parameters.  Since FLAGS_UNION is only used in xfs_icache.c, move it
there as a private flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-08 09:30:20 -07:00
Darrick J. Wong 9492750a8b xfs: selectively keep sick inodes in memory
It's important that the filesystem retain its memory of sick inodes for
a little while after problems are found so that reports can be collected
about what was wrong.  Don't let inode reclamation free sick inodes
unless we're unmounting or the fs already went down.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-08 09:30:20 -07:00
Darrick J. Wong 7975e465af xfs: drop IDONTCACHE on inodes when we mark them sick
When we decide to mark an inode sick, clear the DONTCACHE flag so that
the incore inode will be kept around until memory pressure forces it out
of memory.  This increases the chances that the sick status will be
caught by someone compiling a health report later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-08 09:30:20 -07:00
Darrick J. Wong 255794c7ed xfs: only reset incore inode health state flags when reclaiming an inode
While running some fuzz tests on inode metadata, I noticed that the
filesystem health report (as provided by xfs_spaceman) failed to report
the file corruption even when spaceman was run immediately after running
xfs_scrub to detect the corruption.  That isn't the intended behavior;
one ought to be able to run scrub to detect errors in the ondisk
metadata and be able to access to those reports for some time after the
scrub.

After running the same sequence through an instrumented kernel, I
discovered the reason why -- scrub igets the file, scans it, marks it
sick, and ireleases the inode.  When the VFS lets go of the incore
inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
inode before it moves to RECLAIM state, iget reinitializes the VFS
state, clears the sick and checked masks, and hands back the inode.  At
this point, the caller has the exact same incore inode, but with all the
health state erased.

In other words, we're erasing the incore inode's health state flags when
we've decided NOT to sever the link between the incore inode and the
ondisk inode.  This is wrong, so we need to remove the lines that zero
the fields from xfs_iget_cache_hit.

As a precaution, we add the same lines into xfs_reclaim_inode just after
we sever the link between incore and ondisk inode.  Strictly speaking
this isn't necessary because once an inode has gone through reclaim it
must go through xfs_inode_alloc (which also zeroes the state) and
xfs_iget is careful to check for mismatches between the inode it pulls
out of the radix tree and the one it wants.

Fixes: 6772c1f112 ("xfs: track metadata health status")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-08 09:30:20 -07:00
Darrick J. Wong ffc18582ed xfs: clean up incore inode walk functions
This ambitious series aims to cleans up redundant inode walk code in
 xfs_icache.c, hide implementation details of the quotaoff dquot release
 code, and eliminates indirect function calls from incore inode walks.
 
 The first thing it does is to move all the code that quotaoff calls to
 release dquots from all incore inodes into xfs_icache.c.  Next, it
 separates the goal of an inode walk from the actual radix tree tags that
 may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing.
 Finally, we split the speculative preallocation (blockgc) and quotaoff
 dquot release code paths into separate functions so that we can keep the
 implementations cohesive.
 
 Christoph suggested last cycle that we 'simply' change quotaoff not to
 allow deactivating quota entirely, but as these cleanups are to enable
 one major change in behavior (deferred inode inactivation) I do not want
 to add a second behavior change (quotaoff) as a dependency.
 
 To be blunt: Additional cleanups are not in scope for this series.
 
 Next, I made two observations about incore inode radix tree walks --
 since there's a 1:1 mapping between the walk goal and the per-inode
 processing function passed in, we can use the goal to make a direct call
 to the processing function.  Furthermore, the only caller to supply a
 nonzero iter_flags argument is quotaoff, and there's only one INEW flag.
 
 From that observation, I concluded that it's quite possible to remove
 two parameters from the xfs_inode_walk* function signatures -- the
 iter_flags, and the execute function pointer.  The middle of the series
 moves the INEW functionality into the one piece (quotaoff) that wants
 it, and removes the indirect calls.
 
 The final observation is that the inode reclaim walk loop is now almost
 the same as xfs_inode_walk, so it's silly to maintain two copies.  Merge
 the reclaim loop code into xfs_inode_walk.
 
 Lastly, refactor the per-ag radix tagging functions since there's
 duplicated code that can be consolidated.
 
 This series is a prerequisite for the next two patchsets, since deferred
 inode inactivation will add another inode radix tree tag and iterator
 function to xfs_inode_walk.
 
 v2: walk the vfs inode list when running quotaoff instead of the radix
     tree, then rework the (now completely internal) inode walk function
     to take the tag as the main parameter.
 v3: merge the reclaim loop into xfs_inode_walk, then consolidate the
     radix tree tagging functions
 v4: rebase to 5.13-rc4
 v5: combine with the quotaoff patchset, reorder functions to minimize
     forward declarations, split inode walk goals from radix tree tags
     to reduce conceptual confusion
 v6: start moving the inode cache code towards the xfs_icwalk prefix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmC5Yv0ACgkQ+H93GTRK
 tOv7Fg//Z7cKph0zSg6qsukMEMZxscuNcEBydCW1bu9gSx1NpszDpiGqAiO5ZB3X
 wP2XkCqjuatbNGGvkNLHS/M4sbLX3ELogvYmMRvUhDoaSFxT/KKgxvsyNffiCSS7
 xRB/rvWRp9MGRpBWPF0ZUxFU6VBzhCrYdMsNhvW95AEup8S/j+NplwoIif0gzaZZ
 Q6Fl4Ca9VEBvJQPV+/zkLih19iFItmARJhPHUs4BO1nZv+CzZBFQHg7Ijw7nW92j
 eSY68W4LH/IQ5cqm+HrD/+Z6ns0P7J2viewzVymkNEGnuX4a0xrQrzQ8ydRsAxTi
 9EDrpIe3MbSI5YjJfmRe8G3LX5p7vBpqc8TeyZdRDMGWkFjT33HPlQNb6WxKLQbA
 mjKdfr8AYZR/UQKW/7oZFrJnOoMpYRAQ4Sn/9BAYZQYm7tiLzuZsrEZ7JBwiUA56
 XHmlsDDeLzJeKvjmUu8M3H4oh4Nwf5/I2vJwHjueTfhl83uJP04igIXC4rnq56bM
 AAAjH9uV11Fo3q0ywAnRtN2HYj8PEJlCMK5CNskILrGeMITsBPGht0SbaA6hDI2h
 GYmltKInHzuPhHC9NfyPVrVr3BrmPR5cBsVFESiz5A4E9rbuKmmna6Yk8MFlMyl8
 FRIA3zVatJ2qQXtsAcdI8AZzMd7ciYhkAgCqFKxv8qK/qxITHh4=
 =Rxdn
 -----END PGP SIGNATURE-----

Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2

xfs: clean up incore inode walk functions

This ambitious series aims to cleans up redundant inode walk code in
xfs_icache.c, hide implementation details of the quotaoff dquot release
code, and eliminates indirect function calls from incore inode walks.

The first thing it does is to move all the code that quotaoff calls to
release dquots from all incore inodes into xfs_icache.c.  Next, it
separates the goal of an inode walk from the actual radix tree tags that
may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing.
Finally, we split the speculative preallocation (blockgc) and quotaoff
dquot release code paths into separate functions so that we can keep the
implementations cohesive.

Christoph suggested last cycle that we 'simply' change quotaoff not to
allow deactivating quota entirely, but as these cleanups are to enable
one major change in behavior (deferred inode inactivation) I do not want
to add a second behavior change (quotaoff) as a dependency.

To be blunt: Additional cleanups are not in scope for this series.

Next, I made two observations about incore inode radix tree walks --
since there's a 1:1 mapping between the walk goal and the per-inode
processing function passed in, we can use the goal to make a direct call
to the processing function.  Furthermore, the only caller to supply a
nonzero iter_flags argument is quotaoff, and there's only one INEW flag.

From that observation, I concluded that it's quite possible to remove
two parameters from the xfs_inode_walk* function signatures -- the
iter_flags, and the execute function pointer.  The middle of the series
moves the INEW functionality into the one piece (quotaoff) that wants
it, and removes the indirect calls.

The final observation is that the inode reclaim walk loop is now almost
the same as xfs_inode_walk, so it's silly to maintain two copies.  Merge
the reclaim loop code into xfs_inode_walk.

Lastly, refactor the per-ag radix tagging functions since there's
duplicated code that can be consolidated.

This series is a prerequisite for the next two patchsets, since deferred
inode inactivation will add another inode radix tree tag and iterator
function to xfs_inode_walk.

v2: walk the vfs inode list when running quotaoff instead of the radix
    tree, then rework the (now completely internal) inode walk function
    to take the tag as the main parameter.
v3: merge the reclaim loop into xfs_inode_walk, then consolidate the
    radix tree tagging functions
v4: rebase to 5.13-rc4
v5: combine with the quotaoff patchset, reorder functions to minimize
    forward declarations, split inode walk goals from radix tree tags
    to reduce conceptual confusion
v6: start moving the inode cache code towards the xfs_icwalk prefix

* tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: refactor per-AG inode tagging functions
  xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag
  xfs: pass struct xfs_eofblocks to the inode scan callback
  xfs: fix radix tree tag signs
  xfs: make the icwalk processing functions clean up the grab state
  xfs: clean up inode state flag tests in xfs_blockgc_igrab
  xfs: remove indirect calls from xfs_inode_walk{,_ag}
  xfs: remove iter_flags parameter from xfs_inode_walk_*
  xfs: move xfs_inew_wait call into xfs_dqrele_inode
  xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab
  xfs: pass the goal of the incore inode walk to xfs_inode_walk()
  xfs: rename xfs_inode_walk functions to xfs_icwalk
  xfs: move the inode walk functions further down
  xfs: detach inode dquots at the end of inactivation
  xfs: move the quotaoff dqrele inode walk into xfs_icache.c

[djwong: added variable names to function declarations while fixing
merge conflicts]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-08 09:26:44 -07:00
Darrick J. Wong 8b943d21d4 xfs: assorted fixes for 5.14, part 1
This branch contains the first round of various small fixes for 5.14.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmC5YvwACgkQ+H93GTRK
 tOuDTxAAmW+Mc4oPx/Fsa2xqvkDYCGaM1QtkIY9McrGsMdzF0JbB7bR5rrUET2+N
 /FsFTIQ31vZfMEJ69HM6NjQ4xXIKhdEj+PlGsjvg62BqhpPBRuFH7dLBWChiRcf2
 M5SN1P9kcAjIoMR2tKzh8FMDmJTcJlJxxBi9kxb3YK1+UesUHsPS8/jSW3OuIov3
 AJzAZ08SrA8ful5uCMy9Mf5uBgfRuAUjDzA5CM5kA1xqlHREQi+oHl62E81N33mu
 RR8tdHQzvPO5rHyX84GzV5cu2CsmDuOPF2nA5SUxRhIZFfMo5mEezA/nxqqACYti
 rnRGxZVwIG9YYBESVxXFQXIjn5lHoQ2jomk/CraszeVEJCteNRnpbVGzd1xczM3u
 0iuDuHy+aVB/3QhfA6/0vjfttCzkMEld9U9c3WEjIaw5iUCxe531yfrvVEyF9blx
 NBjnQyHGbt+y26BzBjD33NJEdDoZqS3UIQ/rmb4f2mitGN5d9faAcJ754uRJt3o4
 K9HXGjuR+iOH/tCZKDL1hBc4M/pFgNdeBWyFYdQhh8eSj9HSCTCG58zAyQ2WPOSr
 6D/f4BMqivKzzz0HicZPJAoazrtcKGrWTHTxidHIkI4le367NOwv6YJquJ0pFMBs
 8L7OdRqYT3yw6+qErCjEn03WkP9O7V8lHf8hFxt7+dNxDukIj40=
 =6eZB
 -----END PGP SIGNATURE-----

Merge tag 'assorted-fixes-5.14-1_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2

xfs: assorted fixes for 5.14, part 1

This branch contains the first round of various small fixes for 5.14.

* tag 'assorted-fixes-5.14-1_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: don't take a spinlock unconditionally in the DIO fastpath
  xfs: mark xfs_bmap_set_attrforkoff static
  xfs: Remove redundant assignment to busy
  xfs: sort variable alphabetically to avoid repeated declaration
2021-06-08 09:22:34 -07:00
Darrick J. Wong f52edf6c54 xfs: various unit conversions
Crafting the realtime file extent size hint fixes revealed various
 opportunities to clean up unit conversions, so now that gets its own
 series.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmC5YvYACgkQ+H93GTRK
 tOtRihAAh5wKVV9Fwafkk3l2zTCw+K/SfciWvZqF8BClAjqzWUDaw4DJ2HS0lSmj
 ezOG+/HhJzZB4udC9WDM4TSFA/elR1pJBu5lSBpAt/K78Mpw/i1aFs1SyAfw6Qsp
 1A8um80jaCyzvcSxU1Hx/35CN0MQwpjJ23LziEDUVBvj/qV7Qj8DmI61CfWWYzs9
 jEJFzIvbKoRqs8Wqws7n4a7fw3VOyqC2EQV290xzumkYKOSAmpWQGkbiU2/YtqKY
 7w/GvjS1zRE7o1cGeKBYRA/35zbw5mqoHD5vJmdlUIsWK22fnN8h5vqYo5dpFPiz
 nFyK+MHVrV6WxKBKECHtgm7Jya+jjH0TFu/9js4k31ehe4LJc/nhEqmSrFiH3eGS
 4AXrhhZqtDZ1skjcPym83dCayW1uE2cWbHqYJ+ztQW5PrLafoWQ9Ld/vtrbL+b/U
 VIgh3LQmF371jGcE0twERNQPIb5F96w9mS6F+vq0JuvrGftxoa4tKVjE8jrNWJjo
 750/KT0Nupti0AYKq9WMD47BiSf5BbdnhYTN15X0mc5TyHo/EyXIl/I/hFfEMFp9
 AH/qWaQlt3LjMXvr1xF6R4RNSI+xxai7PHfW69Zi7tBea+7k9wTZjubgc2FFaMOQ
 Jf8n5L07IXVYDy0mGUGaokenQb2KPvfXVK7yYtaMj1qDh7Sjqjw=
 =HGL1
 -----END PGP SIGNATURE-----

Merge tag 'unit-conversion-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2

xfs: various unit conversions

Crafting the realtime file extent size hint fixes revealed various
opportunities to clean up unit conversions, so now that gets its own
series.

* tag 'unit-conversion-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: remove unnecessary shifts
  xfs: clean up open-coded fs block unit conversions
2021-06-08 09:21:24 -07:00
Dave Chinner 9ba0889e22 xfs: drop the AGI being passed to xfs_check_agi_freecount
From: Dave Chinner <dchinner@redhat.com>

Stephen Rothwell reported this compiler warning from linux-next:

fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt':
fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi' [-Wunused-variable]
 2032 |  struct xfs_agi   *agi = agbp->b_addr;

Which is fallout from agno -> perag conversions that were done in
this function. xfs_check_agi_freecount() is the only user of "agi"
in xfs_difree_finobt() now, and it only uses the agi to get the
current free inode count. We hold that in the perag structure, so
there's not need to directly reference the raw AGI to get this
information.

The btree cursor being passed to xfs_check_agi_freecount() has a
reference to the perag being operated on, so use that directly in
xfs_check_agi_freecount() rather than passing an AGI.

Fixes: 7b13c51551 ("xfs: use perag for ialloc btree cursors")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-08 09:19:22 -07:00
Darrick J. Wong c3eabd3650 xfs: initial agnumber -> perag conversions for shrink
If we want to use active references to the perag to be able to gate
 shrink removing AGs and hence perags safely, we've got a fair bit of
 work to do actually use perags in all the places we need to.
 
 There's a lot of code that iterates ag numbers and then
 looks up perags from that, often multiple times for the same perag
 in the one operation. If we want to use reference counted perags for
 access control, then we need to convert all these uses to perag
 iterators, not agno iterators.
 
 [Patches 1-4]
 
 The first step of this is consolidating all the perag management -
 init, free, get, put, etc into a common location. THis is spread all
 over the place right now, so move it all into libxfs/xfs_ag.[ch].
 This does expose kernel only bits of the perag to libxfs and hence
 userspace, so the structures and code is rearranged to minimise the
 number of ifdefs that need to be added to the userspace codebase.
 The perag iterator in xfs_icache.c is promoted to a first class API
 and expanded to the needs of the code as required.
 
 [Patches 5-10]
 
 These are the first basic perag iterator conversions and changes to
 pass the perag down the stack from those iterators where
 appropriate. A lot of this is obvious, simple changes, though in
 some places we stop passing the perag down the stack because the
 code enters into an as yet unconverted subsystem that still uses raw
 AGs.
 
 [Patches 11-16]
 
 These replace the agno passed in the btree cursor for per-ag btree
 operations with a perag that is passed to the cursor init function.
 The cursor takes it's own reference to the perag, and the reference
 is dropped when the cursor is deleted. Hence we get reference
 coverage for the entire time the cursor is active, even if the code
 that initialised the cursor drops it's reference before the cursor
 or any of it's children (duplicates) have been deleted.
 
 The first patch adds the perag infrastructure for the cursor, the
 next four patches convert a btree cursor at a time, and the last
 removes the agno from the cursor once it is unused.
 
 [Patches 17-21]
 
 These patches are a demonstration of the simplifications and
 cleanups that come from plumbing the perag through interfaces that
 select and then operate on a specific AG. In this case the inode
 allocation algorithm does up to three walks across all AGs before it
 either allocates an inode or fails. Two of these walks are purely
 just to select the AG, and even then it doesn't guarantee inode
 allocation success so there's a third walk if the selected AG
 allocation fails.
 
 These patches collapse the selection and allocation into a single
 loop, simplifies the error handling because xfs_dir_ialloc() always
 returns ENOSPC if no AG was selected for inode allocation or we fail
 to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
 wrapper, converts inode allocation to run entirely from a single
 perag instance, and then factors xfs_dialloc() into a much, much
 simpler loop which is easy to understand.
 
 Hence we end up with the same inode allocation logic, but it only
 needs two complete iterations at worst, makes AG selection and
 allocation atomic w.r.t. shrink and chops out out over 100 lines of
 code from this hot code path.
 
 [Patch 22]
 
 Converts the unlink path to pass perags through it.
 
 There's more conversion work to be done, but this patchset gets
 through a large chunk of it in one hit. Most of the iterators are
 converted, so once this is solidified we can move on to converting
 these to active references for being able to free perags while the
 fs is still active.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmC3HUgUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h2yaw/+P0JzpI+6n06Ei00mjgE/Du/WhMLi
 0JQ93Grlj+miuGGT9DgGCiRpoZnefhEk+BH6JqoEw1DQ3T5ilmAzrHLUUHSQC3+S
 dv85sJduheQ6yHuoO+4MzkaSq6JWKe7E9gZwAsVyBul5aSjdmaJaQdPwYMTXSXo0
 5Uqq8ECFkMcaHVNjcBfasgR/fdyWy2Qe4PFTHTHdQpd+DNZ9UXgFKHW2og+1iry/
 zDIvdIppJULA09TvVcZuFjd/1NzHQ/fLj5PAzz8GwagB4nz2x3s78Zevmo5yW/jK
 3/+50vXa8ldhiHDYGTS3QXvS0xJRyqUyD47eyWOOiojZw735jEvAlCgjX6+0X1HC
 k3gCkQLv8l96fRkvUpgnLf/fjrUnlCuNBkm9d1Eq2Tied8dvLDtiEzoC6L05Nqob
 yd/nIUb1zwJFa9tsoheHhn0bblTGX1+zP0lbRJBje0LotpNO9DjGX5JoIK4GR7F8
 y1VojcdgRI14HlxUnbF3p8wmQByN+M2tnp6GSdv9BA65bjqi05Rj/steFdZHBV6x
 wiRs8Yh6BTvMwKgufHhRQHfRahjNHQ/T/vOE+zNbWqemS9wtEUDop+KvPhC36R/k
 o/cmr23cF8ESX2eChk7XM4On3VEYpcvp2zSFgrFqZYl6RWOwEis3Htvce3KuSTPp
 8Xq70te0gr2DVUU=
 =YNzW
 -----END PGP SIGNATURE-----

Merge tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2

xfs: initial agnumber -> perag conversions for shrink

If we want to use active references to the perag to be able to gate
shrink removing AGs and hence perags safely, we've got a fair bit of
work to do actually use perags in all the places we need to.

There's a lot of code that iterates ag numbers and then
looks up perags from that, often multiple times for the same perag
in the one operation. If we want to use reference counted perags for
access control, then we need to convert all these uses to perag
iterators, not agno iterators.

[Patches 1-4]

The first step of this is consolidating all the perag management -
init, free, get, put, etc into a common location. THis is spread all
over the place right now, so move it all into libxfs/xfs_ag.[ch].
This does expose kernel only bits of the perag to libxfs and hence
userspace, so the structures and code is rearranged to minimise the
number of ifdefs that need to be added to the userspace codebase.
The perag iterator in xfs_icache.c is promoted to a first class API
and expanded to the needs of the code as required.

[Patches 5-10]

These are the first basic perag iterator conversions and changes to
pass the perag down the stack from those iterators where
appropriate. A lot of this is obvious, simple changes, though in
some places we stop passing the perag down the stack because the
code enters into an as yet unconverted subsystem that still uses raw
AGs.

[Patches 11-16]

These replace the agno passed in the btree cursor for per-ag btree
operations with a perag that is passed to the cursor init function.
The cursor takes it's own reference to the perag, and the reference
is dropped when the cursor is deleted. Hence we get reference
coverage for the entire time the cursor is active, even if the code
that initialised the cursor drops it's reference before the cursor
or any of it's children (duplicates) have been deleted.

The first patch adds the perag infrastructure for the cursor, the
next four patches convert a btree cursor at a time, and the last
removes the agno from the cursor once it is unused.

[Patches 17-21]

These patches are a demonstration of the simplifications and
cleanups that come from plumbing the perag through interfaces that
select and then operate on a specific AG. In this case the inode
allocation algorithm does up to three walks across all AGs before it
either allocates an inode or fails. Two of these walks are purely
just to select the AG, and even then it doesn't guarantee inode
allocation success so there's a third walk if the selected AG
allocation fails.

These patches collapse the selection and allocation into a single
loop, simplifies the error handling because xfs_dir_ialloc() always
returns ENOSPC if no AG was selected for inode allocation or we fail
to allocate an inode in any AG, gets rid of xfs_dir_ialloc()
wrapper, converts inode allocation to run entirely from a single
perag instance, and then factors xfs_dialloc() into a much, much
simpler loop which is easy to understand.

Hence we end up with the same inode allocation logic, but it only
needs two complete iterations at worst, makes AG selection and
allocation atomic w.r.t. shrink and chops out out over 100 lines of
code from this hot code path.

[Patch 22]

Converts the unlink path to pass perags through it.

There's more conversion work to be done, but this patchset gets
through a large chunk of it in one hit. Most of the iterators are
converted, so once this is solidified we can move on to converting
these to active references for being able to free perags while the
fs is still active.

* tag 'xfs-perag-conv-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (23 commits)
  xfs: remove xfs_perag_t
  xfs: use perag through unlink processing
  xfs: clean up and simplify xfs_dialloc()
  xfs: inode allocation can use a single perag instance
  xfs: get rid of xfs_dir_ialloc()
  xfs: collapse AG selection for inode allocation
  xfs: simplify xfs_dialloc_select_ag() return values
  xfs: remove agno from btree cursor
  xfs: use perag for ialloc btree cursors
  xfs: convert allocbt cursors to use perags
  xfs: convert refcount btree cursor to use perags
  xfs: convert rmap btree cursor to using a perag
  xfs: add a perag to the btree cursor
  xfs: pass perags around in fsmap data dev functions
  xfs: push perags through the ag reservation callouts
  xfs: pass perags through to the busy extent code
  xfs: convert secondary superblock walk to use perags
  xfs: convert xfs_iwalk to use perag references
  xfs: convert raw ag walks to use for_each_perag
  xfs: make for_each_perag... a first class citizen
  ...
2021-06-08 09:13:13 -07:00
Darrick J. Wong ebf2e33723 xfs: buffer cache bulk page allocation
This patchset makes use of the new bulk page allocation interface to
 reduce the overhead of allocating large numbers of pages in a
 loop.
 
 The first two patches are refactoring buffer memory allocation and
 converting the uncached buffer path to use the same page allocation
 path, followed by converting the page allocation path to use bulk
 allocation.
 
 The rest of the patches are then consolidation of the page
 allocation and freeing code to simplify the code and remove a chunk
 of unnecessary abstraction. This is largely based on a series of
 changes made by Christoph Hellwig.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmC+6OwUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h21QQ/8C0f7wq1OKwNI2oRubf6J8jtttiRS
 SD2TA03AP2OIOKx4y2G0h0dJeX9tgnbLerIlpfT80nHDoBgKHbZCYSEHQT0DscYo
 fuwQTVR8RCklKspjUlCGR+Gbm6vI8HakK1lAppw168e4c6t8wX1KiSibwaVTQdaZ
 NaXUqTUzGiNq+iiLS6fW3mJ3PKWFJYyrDOSR2jIPbUGIJdejRCGe0xVnu+hIsz+y
 c2gSGCB+j3cYaazhlJTDYPGja3Wq3eR+Ya9i1GcA1tJiJLsu0ZjaVQ69Bl4dud2F
 c3OyhFK0El1VMSEVb3hY8gTpAO02jNWSnB2Zlidt0h4ZJVAxKus0xe2w3eS4uST2
 hcMI3lwjdzRQuoBwOgXQ+CpYVv2wI8HPNLTSR+NYcC2IZaCNieFRWdTYwXrAJBB3
 H09m04GT/7TkkrYHFD1zRtIedP4DZ6MZn/33bufNxEt1NRCFw5AFAEUFfjDA317A
 4nByCmU6XjmmpI/XLixwu0BYCfKVB4UsrgOyzXBy7ZU0+pIser+ynP1V4d9Bb43Y
 xVQ8S0QirT7gqXjx75mD4B4qkXZ5nrz5Z7fSn6YU4TwqsYtZYlsBauLlWmmHp9MT
 CP4PA4j+CQORhfZzWXw2ViXYGoIssc1cw5i4JB6a4u/OaDi19dYkE6SO8P3b9GSm
 khHqWgcTC4VGpmc=
 =JsrV
 -----END PGP SIGNATURE-----

Merge tag 'xfs-buf-bulk-alloc-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.14-merge2

xfs: buffer cache bulk page allocation

This patchset makes use of the new bulk page allocation interface to
reduce the overhead of allocating large numbers of pages in a
loop.

The first two patches are refactoring buffer memory allocation and
converting the uncached buffer path to use the same page allocation
path, followed by converting the page allocation path to use bulk
allocation.

The rest of the patches are then consolidation of the page
allocation and freeing code to simplify the code and remove a chunk
of unnecessary abstraction. This is largely based on a series of
changes made by Christoph Hellwig.

* tag 'xfs-buf-bulk-alloc-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: merge xfs_buf_allocate_memory
  xfs: cleanup error handling in xfs_buf_get_map
  xfs: get rid of xb_to_gfp()
  xfs: simplify the b_page_count calculation
  xfs: remove ->b_offset handling for page backed buffers
  xfs: move page freeing into _xfs_buf_free_pages()
  xfs: merge _xfs_buf_get_pages()
  xfs: use alloc_pages_bulk_array() for buffers
  xfs: use xfs_buf_alloc_pages for uncached buffers
  xfs: split up xfs_buf_allocate_memory
2021-06-08 09:10:01 -07:00
Marc Dionne dc2557308e afs: Fix partial writeback of large files on fsync and close
In commit e87b03f583 ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes.  The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE.  This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.

This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data.  The rest is eventually written
back by background threads after dirty_expire_centisecs.

Fixes: e87b03f583 ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-07 12:56:05 -07:00
Gao Xiang c5fcb51111 erofs: clean up file headers & footers
- Remove my outdated misleading email address;

 - Get rid of all unnecessary trailing newline by accident.

Link: https://lore.kernel.org/r/20210602160634.10757-1-xiang@kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-08 00:41:24 +08:00
Yue Hu 7dea3de7d3 erofs: remove the occupied parameter from z_erofs_pagevec_enqueue()
No any behavior to variable occupied in z_erofs_attach_page() which
is only caller to z_erofs_pagevec_enqueue().

Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-06-08 00:40:18 +08:00
Wei Yongjun 0508c1ad0f erofs: fix error return code in erofs_read_superblock()
'ret' will be overwritten to 0 if erofs_sb_has_sb_chksum() return true,
thus 0 will return in some error handling cases. Fix to return negative
error code -EINVAL instead of 0.

Link: https://lore.kernel.org/r/20210519141657.3062715-1-weiyongjun1@huawei.com
Fixes: b858a4844c ("erofs: support superblock checksum")
Cc: stable <stable@vger.kernel.org> # 5.5+
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Gao Xiang <xiang@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-06-08 00:40:18 +08:00
Jan Kara 64c2c2c62f quota: Change quotactl_path() systcall to an fd-based one
Some users have pointed out that path-based syscalls are problematic in
some environments and at least directory fd argument and possibly also
resolve flags are desirable for such syscalls. Rather than
reimplementing all details of pathname lookup and following where it may
eventually evolve, let's go for full file descriptor based syscall
similar to how ioctl(2) works since the beginning. Managing of quotas
isn't performance sensitive so the extra overhead of open does not
matter and we are able to consume O_PATH descriptors as well which makes
open cheap anyway. Also for frequent operations (such as retrieving
usage information for all users) we can reuse single fd and in fact get
even better performance as well as avoiding races with possible remounts
etc.

Tested-by: Sascha Hauer <s.hauer@pengutronix.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-07 12:11:24 +02:00
Dave Chinner 8bcac7448a xfs: merge xfs_buf_allocate_memory
It only has one caller and is now a simple function, so merge it
into the caller.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:48 +10:00
Christoph Hellwig 170041f715 xfs: cleanup error handling in xfs_buf_get_map
Use a single goto label for freeing the buffer and returning an
error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2021-06-07 11:50:47 +10:00
Dave Chinner 289ae7b48c xfs: get rid of xb_to_gfp()
Only used in one place, so just open code the logic in the macro.
Based on a patch from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:17 +10:00
Christoph Hellwig 934d1076bb xfs: simplify the b_page_count calculation
Ever since we stopped using the Linux page cache to back XFS buffers
there is no need to take the start sector into account for
calculating the number of pages in a buffer, as the data always
start from the beginning of the buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to suit this series]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:50:00 +10:00
Christoph Hellwig 54cd3aa6f8 xfs: remove ->b_offset handling for page backed buffers
->b_offset can only be non-zero for _XBF_KMEM backed buffers, so
remove all code dealing with it for page backed buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[dgc: modified to fit this patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-07 11:49:50 +10:00
Linus Torvalds 20e41d9bc8 Miscellaneous ext4 bug fixes for v5.13
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmC82AQACgkQ8vlZVpUN
 gaOkAgf+KH57P/P0sB6aVBHpAzqa9jTKJWMA5kpCqYUDkYlfF7n2hwsjMzWpJ5MY
 ZvFpKAflmRnve/ULUZQX6+zrcbieNs3e+6VFZrZ0PmxN0dupyISLY7jnvCRDleA7
 BFO34AcH+QEst9zXJmgta9eoy3LA8sawhQ/d7ujVY+IRFk40m26fuAMiaGznlQJ5
 dmrx7pHZWKFIDFIg2TdFlP+Voqbxs2VTT16gmWpGBdTyWYHKjbSOLKJFc9DwYeE9
 aANf6iIzwXz7y9pZiOnTrGuKDEJcIZNESkbIqw62YgqsoObLbsbCZNmNcqxyHpYQ
 Mh3L59KtmjANW3iOxQfyxkNTugxchw==
 =BSnf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Miscellaneous ext4 bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Only advertise encrypted_casefold when encryption and unicode are enabled
  ext4: fix no-key deletion for encrypt+casefold
  ext4: fix memory leak in ext4_fill_super
  ext4: fix fast commit alignment issues
  ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
  ext4: fix accessing uninit percpu counter variable with fast_commit
  ext4: fix memory leak in ext4_mb_init_backend on error path.
2021-06-06 14:24:13 -07:00
Daniel Rosenberg e71f99f2df ext4: Only advertise encrypted_casefold when encryption and unicode are enabled
Encrypted casefolding is only supported when both encryption and
casefolding are both enabled in the config.

Fixes: 471fbbea7f ("ext4: handle casefolding with encryption")
Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-06 10:10:23 -04:00
Daniel Rosenberg 63e7f12893 ext4: fix no-key deletion for encrypt+casefold
commit 471fbbea7f ("ext4: handle casefolding with encryption") is
missing a few checks for the encryption key which are needed to
support deleting enrypted casefolded files when the key is not
present.

This bug made it impossible to delete encrypted+casefolded directories
without the encryption key, due to errors like:

    W         : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key

Repro steps in kvm-xfstests test appliance:
      mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
      mount /vdc
      mkdir /vdc/dir
      chattr +F /vdc/dir
      keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
      xfs_io -c "set_encpolicy $keyid" /vdc/dir
      for i in `seq 1 100`; do
          mkdir /vdc/dir/$i
      done
      xfs_io -c "rm_enckey $keyid" /vdc
      rm -rf /vdc/dir # fails with the bug

Fixes: 471fbbea7f ("ext4: handle casefolding with encryption")
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-06 10:10:23 -04:00
Alexey Makhalov afd09b617d ext4: fix memory leak in ext4_fill_super
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.

If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.

This can easily be reproduced by calling an infinite loop of:

  systemctl start <ext4_on_lvm>.mount, and
  systemctl stop <ext4_on_lvm>.mount

... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.

Fixes: ce40733ce9 ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2021-06-06 10:10:23 -04:00
Harshad Shirwadkar a7ba36bc94 ext4: fix fast commit alignment issues
Fast commit recovery data on disk may not be aligned. So, when the
recovery code reads it, this patch makes sure that fast commit info
found on-disk is first memcpy-ed into an aligned variable before
accessing it. As a consequence of it, we also remove some macros that
could resulted in unaligned accesses.

Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-06 10:10:23 -04:00
Ye Bin 082cd4ec24 ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
We got follow bug_on when run fsstress with injecting IO fault:
[130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
[130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
......
[130747.334329] Call trace:
[130747.334553]  ext4_es_cache_extent+0x150/0x168 [ext4]
[130747.334975]  ext4_cache_extents+0x64/0xe8 [ext4]
[130747.335368]  ext4_find_extent+0x300/0x330 [ext4]
[130747.335759]  ext4_ext_map_blocks+0x74/0x1178 [ext4]
[130747.336179]  ext4_map_blocks+0x2f4/0x5f0 [ext4]
[130747.336567]  ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
[130747.336995]  ext4_readpage+0x54/0x100 [ext4]
[130747.337359]  generic_file_buffered_read+0x410/0xae8
[130747.337767]  generic_file_read_iter+0x114/0x190
[130747.338152]  ext4_file_read_iter+0x5c/0x140 [ext4]
[130747.338556]  __vfs_read+0x11c/0x188
[130747.338851]  vfs_read+0x94/0x150
[130747.339110]  ksys_read+0x74/0xf0

This patch's modification is according to Jan Kara's suggestion in:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/
"I see. Now I understand your patch. Honestly, seeing how fragile is trying
to fix extent tree after split has failed in the middle, I would probably
go even further and make sure we fix the tree properly in case of ENOSPC
and EDQUOT (those are easily user triggerable).  Anything else indicates a
HW problem or fs corruption so I'd rather leave the extent tree as is and
don't try to fix it (which also means we will not create overlapping
extents)."

Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-06 10:09:55 -04:00
Junxiao Bi 6bba4471f0 ocfs2: fix data corruption by fallocate
When fallocate punches holes out of inode size, if original isize is in
the middle of last cluster, then the part from isize to the end of the
cluster will be zeroed with buffer write, at that time isize is not yet
updated to match the new size, if writeback is kicked in, it will invoke
ocfs2_writepage()->block_write_full_page() where the pages out of inode
size will be dropped.  That will cause file corruption.  Fix this by
zero out eof blocks when extending the inode size.

Running the following command with qemu-image 4.2.1 can get a corrupted
coverted image file easily.

    qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
             -O qcow2 -o compat=1.1 $qcow_image.conv

The usage of fallocate in qemu is like this, it first punches holes out
of inode size, then extend the inode size.

    fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
    fallocate(11, 0, 2276196352, 65536) = 0

v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/

Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-05 08:58:12 -07:00
Eric Biggers 2fc2b430f5 fscrypt: fix derivation of SipHash keys on big endian CPUs
Typically, the cryptographic APIs that fscrypt uses take keys as byte
arrays, which avoids endianness issues.  However, siphash_key_t is an
exception.  It is defined as 'u64 key[2];', i.e. the 128-bit key is
expected to be given directly as two 64-bit words in CPU endianness.

fscrypt_derive_dirhash_key() and fscrypt_setup_iv_ino_lblk_32_key()
forgot to take this into account.  Therefore, the SipHash keys used to
index encrypted+casefolded directories differ on big endian vs. little
endian platforms, as do the SipHash keys used to hash inode numbers for
IV_INO_LBLK_32-encrypted directories.  This makes such directories
non-portable between these platforms.

Fix this by always using the little endian order.  This is a breaking
change for big endian platforms, but this should be fine in practice
since these features (encrypt+casefold support, and the IV_INO_LBLK_32
flag) aren't known to actually be used on any big endian platforms yet.

Fixes: aa408f835d ("fscrypt: derive dirhash key for casefolded directories")
Fixes: e3b1078bed ("fscrypt: add support for IV_INO_LBLK_32 policies")
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20210605075033.54424-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-06-05 00:52:52 -07:00
Eric Biggers 77f30bfcfc fscrypt: don't ignore minor_hash when hash is 0
When initializing a no-key name, fscrypt_fname_disk_to_usr() sets the
minor_hash to 0 if the (major) hash is 0.

This doesn't make sense because 0 is a valid hash code, so we shouldn't
ignore the filesystem-provided minor_hash in that case.  Fix this by
removing the special case for 'hash == 0'.

This is an old bug that appears to have originated when the encryption
code in ext4 and f2fs was moved into fs/crypto/.  The original ext4 and
f2fs code passed the hash by pointer instead of by value.  So
'if (hash)' actually made sense then, as it was checking whether a
pointer was NULL.  But now the hashes are passed by value, and
filesystems just pass 0 for any hashes they don't have.  There is no
need to handle this any differently from the hashes actually being 0.

It is difficult to reproduce this bug, as it only made a difference in
the case where a filename's 32-bit major hash happened to be 0.
However, it probably had the largest chance of causing problems on
ubifs, since ubifs uses minor_hash to do lookups of no-key names, in
addition to using it as a readdir cookie.  ext4 only uses minor_hash as
a readdir cookie, and f2fs doesn't use minor_hash at all.

Fixes: 0b81d07790 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto")
Cc: <stable@vger.kernel.org> # v4.6+
Link: https://lore.kernel.org/r/20210527235236.2376556-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-06-05 00:22:53 -07:00
Christoph Hellwig 603e4922f1 remove the raw driver
The raw driver used to provide direct unbuffered access to block devices
before O_DIRECT was invented.  It has been obsolete for more than a
decade.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/Pine.LNX.4.64.0703180754060.6605@CPE00045a9c397f-CM001225dbafb6/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210531072526.97052-1-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-04 15:35:03 +02:00
Dietmar Eggemann f501b6a231 debugfs: Fix debugfs_read_file_str()
Read the entire size of the buffer, including the trailing new line
character.
Discovered while reading the sched domain names of CPU0:

before:

cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
SMTMCDIE

after:

cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
SMT
MC
DIE

Fixes: 9af0440ec8 ("debugfs: Implement debugfs_create_str()")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210527091105.258457-1-dietmar.eggemann@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-04 15:01:08 +02:00
Nikolay Borisov aefd7f7065 btrfs: promote debugging asserts to full-fledged checks in validate_super
Syzbot managed to trigger this assert while performing its fuzzing.
Turns out it's better to have those asserts turned into full-fledged
checks so that in case buggy btrfs images are mounted the users gets
an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT
disabled such image would have been erroneously allowed to be mounted.

Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add uuids to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04 13:12:06 +02:00
Ritesh Harjani e7b2ec3d3d btrfs: return value from btrfs_mark_extent_written() in case of error
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04 13:11:58 +02:00
Naohiro Aota 5b434df877 btrfs: zoned: fix zone number to sector/physical calculation
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t
sector" by shifting it. But, this "sector" is calculated in 32bit, leading
it to be 0 for the 2nd superblock copy.

Since zone number is u32, shifting it to sector (sector_t) or physical
address (u64) can easily trigger a missing cast bug like this.

This commit introduces helpers to convert zone number to sector/LBA, so we
won't fall into the same pitfall again.

Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.11+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04 13:11:50 +02:00
Josef Bacik 165ea85f14 btrfs: do not write supers if we have an fs error
Error injection testing uncovered a pretty severe problem where we could
end up committing a super that pointed to the wrong tree roots,
resulting in transid mismatch errors.

The way we commit the transaction is we update the super copy with the
current generations and bytenrs of the important roots, and then copy
that into our super_for_commit.  Then we allow transactions to continue
again, we write out the dirty pages for the transaction, and then we
write the super.  If the write out fails we'll bail and skip writing the
supers.

However since we've allowed a new transaction to start, we can have a
log attempting to sync at this point, which would be blocked on
fs_info->tree_log_mutex.  Once the commit fails we're allowed to do the
log tree commit, which uses super_for_commit, which now points at fs
tree's that were not written out.

Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the
tree_log_mutex.  This way if the transaction commit fails we're sure to
see this bit set and we can skip writing the super out.  This patch
fixes this specific transid mismatch error I was seeing with this
particular error path.

CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-04 13:11:38 +02:00
Darrick J. Wong c076ae7a93 xfs: refactor per-AG inode tagging functions
In preparation for adding another incore inode tree tag, refactor the
code that sets and clears tags from the per-AG inode tree and the tree
of per-AG structures, and remove the open-coded versions used by the
blockgc code.

Note: For reclaim, we now rely on the radix tree tags instead of the
reclaimable inode count more heavily than we used to.  The conversion
should be fine, but the logic isn't 100% identical.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:04 -07:00
Darrick J. Wong f1bc5c5630 xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag
Merge these two inode walk loops together, since they're pretty similar
now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:04 -07:00
Darrick J. Wong 9d5ee83759 xfs: pass struct xfs_eofblocks to the inode scan callback
Pass a pointer to the actual eofb structure around the inode scanner
functions instead of a void pointer, now that none of the functions is
used as a callback.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:04 -07:00
Darrick J. Wong 919a4ddb68 xfs: fix radix tree tag signs
Radix tree tags are supposed to be unsigned ints, so fix the callers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:04 -07:00
Darrick J. Wong 594ab00b76 xfs: make the icwalk processing functions clean up the grab state
Soon we're going to be adding two new callers to the incore inode walk
code: reclaim of incore inodes, and (later) inactivation of inodes.
Both states operate on inodes that no longer have any VFS state, so we
need to move the xfs_irele calls into the processing functions.

In other words, icwalk processing functions are responsible for cleaning
up whatever state changes are made by the corresponding icwalk igrab
function that picked the inode for processing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong d20d5edcf9 xfs: clean up inode state flag tests in xfs_blockgc_igrab
Clean up the definition of which inode states are not eligible for
speculative preallocation garbage collecting by creating a private
#define.  The deferred inactivation patchset will add two new entries to
the set of flags-to-ignore, so we want the definition not to end up a
cluttered mess.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong f427cf5c62 xfs: remove indirect calls from xfs_inode_walk{,_ag}
It turns out that there is a 1:1 mapping between the execute and goal
parameters that are passed to xfs_inode_walk_ag:

	xfs_blockgc_scan_inode <=> XFS_ICWALK_BLOCKGC
	xfs_dqrele_inode <=> XFS_ICWALK_DQRELE

Because of this exact correspondence, we don't need the execute function
pointer and can replace it with a direct call.

For the price of a forward static declaration, we can eliminate the
indirect function call.  This likely has a negligible impact on
performance (since the execute function runs transactions), but it also
simplifies the function signature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong 7fdff52623 xfs: remove iter_flags parameter from xfs_inode_walk_*
The sole iter_flags is XFS_INODE_WALK_INEW_WAIT, and there are no users.
Remove the flag, and the parameter, and all the code that used it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong 9d2793ceec xfs: move xfs_inew_wait call into xfs_dqrele_inode
Move the INEW wait into xfs_dqrele_inode so that we can drop the
iter_flags parameter in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong b9baaef42f xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab
Disentangle the dqrele_all inode grab code from the "generic" inode walk
grabbing code, and and use the opportunity to document why the dqrele
grab function does what it does.  Since xfs_inode_walk_ag_grab is now
only used for blockgc, rename it to reflect that.

Ultimately, there will be four reasons to perform a walk of incore
inodes: quotaoff dquote releasing (dqrele), garbage collection of
speculative preallocations (blockgc), reclamation of incore inodes
(reclaim), and deferred inactivation (inodegc).  Each of these four have
their own slightly different criteria for deciding if they want to
handle an inode, so it makes more sense to have four cohesive igrab
functions than one confusing parameteric grab function like we do now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:03 -07:00
Darrick J. Wong c809d7e948 xfs: pass the goal of the incore inode walk to xfs_inode_walk()
As part of removing the indirect calls and radix tag implementation
details from the incore inode walk loop, create an enum to represent the
goal of the inode iteration.  More immediately, this separate removes
the need for the "ICI_NOTAG" define which makes little sense.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:02 -07:00
Darrick J. Wong c1115c0cba xfs: rename xfs_inode_walk functions to xfs_icwalk
Shorten the prefix so that all the incore inode cache walk code has
"xfs_icwalk" in the name somewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:02 -07:00
Darrick J. Wong df60019739 xfs: move the inode walk functions further down
Move the inode walk functions further down in the file to limit the
forward declarations to the two walk functions as we add new code that
uses the inode walks.  We'll clean them out later (i.e. after the
deferred inode inactivation series).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:02 -07:00
Darrick J. Wong 3ea06d73e3 xfs: detach inode dquots at the end of inactivation
Once we're done with inactivating an inode, we're finished updating
metadata for that inode.  This means that we can detach the dquots at
the end and not have to wait for reclaim to do it for us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:02 -07:00
Darrick J. Wong 1ad2cfe0a5 xfs: move the quotaoff dqrele inode walk into xfs_icache.c
The only external caller of xfs_inode_walk* happens in quotaoff, when we
want to walk all the incore inodes to detach the dquots.  Move this code
to xfs_icache.c so that we can hide xfs_inode_walk as the starting step
in more cleanups of inode walks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03 15:56:02 -07:00
Linus Torvalds ec95502396 io_uring-5.13-2021-06-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmC5BrwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq3tD/9FGANoxDDpLbQg/FCiK1pNoSf0EyoEWSdg
 ysTF5KPAPC3msQOmuPYwRZfRFCkvtOHmrexPAZAaorCxEYPjiVAZ9b/a0hBC4Zc1
 vVW8RcTp6hSonAp1kk6VgLEHulJMcLANjAx3Me3NDRB/g0KGW5gevqkUXIJ+nXiR
 nqZcxaK7MD90v74IomO7y4P1GgwCbRhKYUL0JGQ4tXndYLxBYJnXBSnIKS2WdLZD
 PCBf+TDDFAZeioueZ/GrXRhWBmy97j8sEKUJLRqjI5YG8VVZSofgPlwNBi1e42C8
 l3ZEmXldyk18O8KDsZCI2E8axt62gLjuD7Tu6+gv0GBJTcdyXP/FaZYbkBMWjMBH
 yq4Dk4QyJWfMFHJ886ukbGpwj1HJT1cJqzg4UUdkV3BlMNKtmZD8XrKTBw4HcPww
 EmB+yywRiuH+XqamxPglFXUEOa4bJH/EAsQ0R5NNxAT/X/9iIOLUDBDAGvtWtBr0
 7cz+7jTQchqmV11gN+JcgN2LvG14m6Xq4Xtv5oHhIy/FHbRCNPPrC7KJ22TOBSaD
 d9mS5VM12+O9r9plYW7Cqdhdhnho/7/VfB+puiHg/lVcsXrMlrr0sc/WyrUixZeL
 AUlhDtmoROcyFpdcA49LCBEFvacu13ivEstkxIonx997Ct4MW7joYds2YfHCfuoO
 YlPVGdqeag==
 =3m3m
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Just a single one-liner fix for an accounting regression in this
  release"

* tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block:
  io_uring: fix misaccounting fix buf pinned pages
2021-06-03 11:41:00 -07:00
Linus Torvalds fd2ff2774e for-5.13-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmC435cACgkQxWXV+ddt
 WDuh5w/+IGfsUFfKikJZpZUP7q/2gC0t0dzZemxeZMutJbT/KCZCDd4CjLf6YH6r
 oV9uYIgOWGd3aem9fe0R60ErJ4htgszIgeydCw3s2EuTms6WvAVA6Wp+wK/3UNx3
 vQgYsqYkhMzIYKm/D4q8G+bqA2nPbBTDRNsXDIDrZYONxwSb+dNbQCGVknBRzRPa
 hiCqYhUSyXA7E6UZdlma7MvpDOquZN+iW3RRVx1AULLqVs01PCnG/CEN+0oQm2JE
 r9IyRxOZUvSeW6opT80yzZFCoboNSduMjPENTfzLY6Q1xzS/EtP4kM86fB/7AoJv
 UI0c3Sr84SC9vOsBsbGJaBHpxP3OpzxohKU///jVQgEDpGv4STPlkVfxk23BHcux
 Fdfg7wodkXeLU1Ff4dlJhvCqNYqc5V8lT5Kl52ai9Scct6D4yZBAq4KJp2LmYFC0
 cHv6xFxBUv5zFZP1j6NMOmiLlCdDEkOruku2mMweQOBWYW/lHYNU469V5RCvfbLl
 HlbDrtZdnQ3m2IhpQrXiTnT47Ib4DPYWkhRVfWbyVJHA+CbcOV62RQfl+r95Bc7j
 FB1gM5vwUTJV7wgzErrq7+BD8quxG6/NuLDFjHYRcIj1kSIMK4/I1fOWruzuK+CL
 6n7LLvBOojYfFo+ruQMSp2imDn3JJucBuh0/ssOlUWl2zsy6lDA=
 =8066
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Error handling improvements, caught by error injection:

   - handle errors during checksum deletion

   - set error on mapping when ordered extent io cannot be finished

   - inode link count fixup in tree-log

   - missing return value checks for inode updates in tree-log

   - abort transaction in rename exchange if adding second reference
     fails

  Fixes:

   - fix fsync failure after writes to prealloc extents

   - fix deadlock when cloning inline extents and low on available space

   - fix compressed writes that cross stripe boundary"

* tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  MAINTAINERS: add btrfs IRC link
  btrfs: fix deadlock when cloning inline extents and low on available space
  btrfs: fix fsync failure and transaction abort after writes to prealloc extents
  btrfs: abort in rename_exchange if we fail to insert the second ref
  btrfs: check error value from btrfs_update_inode in tree log
  btrfs: fixup error handling in fixup_inode_link_counts
  btrfs: mark ordered extent and inode with error if we fail to finish
  btrfs: return errors from btrfs_del_csums in cleanup_ref_head
  btrfs: fix error handling in btrfs_del_csums
  btrfs: fix compressed writes that cross stripe boundary
2021-06-03 11:37:14 -07:00
Ingo Molnar a9e906b71f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03 19:00:49 +02:00
Al Viro 8959a23924 fuse_fill_write_pages(): don't bother with iov_iter_single_seg_count()
another rudiment of fault-in originally having been limited to the
first segment, same as in generic_perform_write() and friends.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-03 10:34:55 -04:00
Trond Myklebust c3aba897c6 NFSv4: Fix second deadlock in nfs4_evict_inode()
If the inode is being evicted but has to return a layout first, then
that too can cause a deadlock in the corner case where the server
reboots.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Trond Myklebust dfe1fe75e0 NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
If the inode is being evicted, but has to return a delegation first,
then it can cause a deadlock in the corner case where the server reboots
before the delegreturn completes, but while the call to iget5_locked() in
nfs4_opendata_get_inode() is waiting for the inode free to complete.
Since the open call still holds a session slot, the reboot recovery
cannot proceed.

In order to break the logjam, we can turn the delegation return into a
privileged operation for the case where we're evicting the inode. We
know that in that case, there can be no other state recovery operation
that conflicts.

Reported-by: zhangxiaoxu (A) <zhangxiaoxu5@huawei.com>
Fixes: 5fcdfacc01 ("NFSv4: Return delegations synchronously in evict_inode")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Chuck Lever d1b5c230e9 NFS: FMODE_READ and friends are C macros, not enum types
Address a sparse warning:

  CHECK   fs/nfs/nfstrace.c
fs/nfs/nfstrace.c: note: in included file (through /home/cel/src/linux/rpc-over-tls/include/trace/trace_events.h, /home/cel/src/linux/rpc-over-tls/include/trace/define_trace.h, ...):
fs/nfs/./nfstrace.h:424:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:424:1:    expected unsigned long eval_value
fs/nfs/./nfstrace.h:424:1:    got restricted fmode_t [usertype]
fs/nfs/./nfstrace.h:425:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:425:1:    expected unsigned long eval_value
fs/nfs/./nfstrace.h:425:1:    got restricted fmode_t [usertype]
fs/nfs/./nfstrace.h:426:1: warning: incorrect type in initializer (different base types)
fs/nfs/./nfstrace.h:426:1:    expected unsigned long eval_value
fs/nfs/./nfstrace.h:426:1:    got restricted fmode_t [usertype]

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Dan Carpenter 09226e8303 NFS: Fix a potential NULL dereference in nfs_get_client()
None of the callers are expecting NULL returns from nfs_get_client() so
this code will lead to an Oops.  It's better to return an error
pointer.  I expect that this is dead code so hopefully no one is
affected.

Fixes: 31434f496a ("nfs: check hostname in nfs_get_client")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Anna Schumaker 476bdb04c5 NFS: Fix use-after-free in nfs4_init_client()
KASAN reports a use-after-free when attempting to mount two different
exports through two different NICs that belong to the same server.

Olga was able to hit this with kernels starting somewhere between 5.7
and 5.10, but I traced the patch that introduced the clear_bit() call to
4.13. So something must have changed in the refcounting of the clp
pointer to make this call to nfs_put_client() the very last one.

Fixes: 8dcbec6d20 ("NFSv41: Handle EXCHID4_FLAG_CONFIRMED_R during NFSv4.1 migration")
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Scott Mayhew 0b4f132b15 NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate
Commit ce62b114bb ("NFS: Split attribute support out from the server
capabilities") removed the logic from _nfs4_server_capabilities() that
sets the NFS_CAP_SECURITY_LABEL capability based on the presence of
FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response.
Now NFS_CAP_SECURITY_LABEL is never set, which breaks labelled NFS.

This was replaced with logic that clears the NFS_ATTR_FATTR_V4_SECURITY_LABEL
bit in the newly added fattr_valid field based on the absence of
FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response.
This essentially has no effect since there's nothing looks for that bit
in fattr_supported.

So revert that part of the commit, but adding the logic that sets
NFS_CAP_SECURITY_LABEL near where the other capabilities are set in
_nfs4_server_capabilities().

Fixes: ce62b114bb ("NFS: Split attribute support out from the server capabilities")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-03 10:14:42 -04:00
Ritesh Harjani b45f189a19 ext4: fix accessing uninit percpu counter variable with fast_commit
When running generic/527 with fast_commit configuration, the following
issue is seen on Power.  With fast_commit, during ext4_fc_replay()
(which can be called from ext4_fill_super()), if inode eviction
happens then it can access an uninitialized percpu counter variable.

This patch adds the check before accessing the counters in
ext4_free_inode() path.

[  321.165371] run fstests generic/527 at 2021-04-29 08:38:43
[  323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
[  323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
[  323.619767] Faulting instruction address: 0xc000000000bae78c
cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
    pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
    lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
    pid   = 5593, comm = mount
	ext4_free_inode+0x780/0xb90
	ext4_evict_inode+0xa8c/0xc60
	evict+0xfc/0x1e0
	ext4_fc_replay+0xc50/0x20f0
	do_one_pass+0xfe0/0x1350
	jbd2_journal_recover+0x184/0x2e0
	jbd2_journal_load+0x1c0/0x4a0
	ext4_fill_super+0x2458/0x4200
	mount_bdev+0x1dc/0x290
	ext4_mount+0x28/0x40
	legacy_get_tree+0x4c/0xa0
	vfs_get_tree+0x4c/0x120
	path_mount+0xcf8/0xd70
	do_mount+0x80/0xd0
	sys_mount+0x3fc/0x490
	system_call_exception+0x384/0x3d0
	system_call_common+0xec/0x278

Cc: stable@kernel.org
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-02 21:40:42 -04:00
Dave Chinner 977ec4ddf0 xfs: don't take a spinlock unconditionally in the DIO fastpath
Because this happens at high thread counts on high IOPS devices
doing mixed read/write AIO-DIO to a single file at about a million
iops:

   64.09%     0.21%  [kernel]            [k] io_submit_one
   - 63.87% io_submit_one
      - 44.33% aio_write
         - 42.70% xfs_file_write_iter
            - 41.32% xfs_file_dio_write_aligned
               - 25.51% xfs_file_write_checks
                  - 21.60% _raw_spin_lock
                     - 21.59% do_raw_spin_lock
                        - 19.70% __pv_queued_spin_lock_slowpath

This also happens of the IO completion IO path:

   22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
   - 22.49% xfs_dio_write_end_io
      - 21.79% _raw_spin_lock
         - 20.97% do_raw_spin_lock
            - 20.10% __pv_queued_spin_lock_slowpath

IOWs, fio is burning ~14 whole CPUs on this spin lock.

So, do an unlocked check against inode size first, then if we are
at/beyond EOF, take the spinlock and recheck. This makes the
spinlock disappear from the overwrite fastpath.

I'd like to report that fixing this makes things go faster. It
doesn't - it just exposes the the XFS_ILOCK as the next severe
contention point doing extent mapping lookups, and that now burns
all the 14 CPUs this spinlock was burning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 15:00:38 -07:00
Christoph Hellwig 5a981e4ea8 xfs: mark xfs_bmap_set_attrforkoff static
xfs_bmap_set_attrforkoff is only used inside of xfs_bmap.c, so mark it
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 14:58:59 -07:00
Jiapeng Chong 9673261c32 xfs: Remove redundant assignment to busy
Variable busy is set to false, but this value is never read as it is
overwritten or not used later on, hence it is a redundant assignment
and can be removed.

Clean up the following clang-analyzer warning:

fs/xfs/libxfs/xfs_alloc.c:1679:2: warning: Value stored to 'busy' is
never read [clang-analyzer-deadcode.DeadStores].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 14:56:29 -07:00
Shaokun Zhang 5f7fd75086 xfs: sort variable alphabetically to avoid repeated declaration
Variable 'xfs_agf_buf_ops', 'xfs_agi_buf_ops', 'xfs_dquot_buf_ops' and
'xfs_symlink_buf_ops' are declared twice, so sort these variables
alphabetically and remove the repeated declaration.

Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 14:54:09 -07:00
Al Viro bc1bb416bb generic_perform_write()/iomap_write_actor(): saner logics for short copy
if we run into a short copy and ->write_end() refuses to advance at all,
use the amount we'd managed to copy for the next iteration to handle.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-02 17:50:44 -04:00
Al Viro 9067931236 ntfs_copy_from_user_iter(): don't bother with copying iov_iter
Advance the original, let the caller revert if it needs to.
Don't mess with iov_iter_single_seg_count() in the caller -
if we got a (non-zero) short copy, use the amount actually
copied for the next pass, limit it to "up to the end
of page" if nothing got copied at all.

Originally fault-in only read the first iovec; back then it used
to make sense to limit to the just one iovec for the pass after
short copy.  These days it's no long true.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-02 17:50:38 -04:00
Alexander Aring d10a0b8875 fs: dlm: rename socket and app buffer defines
This patch renames DEFAULT_BUFFER_SIZE to DLM_MAX_SOCKET_BUFSIZE and
LOWCOMMS_MAX_TX_BUFFER_LEN to DLM_MAX_APP_BUFSIZE as they are proper
names to define what's behind those values. The DLM_MAX_SOCKET_BUFSIZE
defines the maximum size of buffer which can be handled on socket layer,
the DLM_MAX_APP_BUFSIZE defines the maximum size of buffer which can be
handled by the DLM application layer.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Alexander Aring ac7d5d036d fs: dlm: introduce proto values
Currently the dlm protocol values are that TCP is 0 and everything else
is SCTP. This makes it difficult to introduce possible other transport
layers. The only one user space tool dlm_controld, which I am aware of,
handles the protocol value 1 for SCTP. We change it now to handle SCTP
as 1, this will break user space API but it will fix it so we can add
possible other transport layers.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Alexander Aring 9a4139a794 fs: dlm: move dlm allow conn
This patch checks if possible allowing new connections is allowed before
queueing the listen socket to accept new connections.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Alexander Aring 6c6a1cc666 fs: dlm: use alloc_ordered_workqueue
The proper way to allocate ordered workqueues is to use
alloc_ordered_workqueue() function. The current way implies an ordered
workqueue which is also required by dlm.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Alexander Aring 700ab1c363 fs: dlm: fix memory leak when fenced
I got some kmemleak report when a node was fenced. The user space tool
dlm_controld will therefore run some rmdir() in dlm configfs which was
triggering some memleaks. This patch stores the sps and cms attributes
which stores some handling for subdirectories of the configfs cluster
entry and free them if they get released as the parent directory gets
freed.

unreferenced object 0xffff88810d9e3e00 (size 192):
  comm "dlm_controld", pid 342, jiffies 4294698126 (age 55438.801s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 73 70 61 63 65 73 00 00  ........spaces..
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000db8b640b>] make_cluster+0x5d/0x360
    [<000000006a571db4>] configfs_mkdir+0x274/0x730
    [<00000000b094501c>] vfs_mkdir+0x27e/0x340
    [<0000000058b0adaf>] do_mkdirat+0xff/0x1b0
    [<00000000d1ffd156>] do_syscall_64+0x40/0x80
    [<00000000ab1408c8>] entry_SYSCALL_64_after_hwframe+0x44/0xae
unreferenced object 0xffff88810d9e3a00 (size 192):
  comm "dlm_controld", pid 342, jiffies 4294698126 (age 55438.801s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 63 6f 6d 6d 73 00 00 00  ........comms...
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000a7ef6ad2>] make_cluster+0x82/0x360
    [<000000006a571db4>] configfs_mkdir+0x274/0x730
    [<00000000b094501c>] vfs_mkdir+0x27e/0x340
    [<0000000058b0adaf>] do_mkdirat+0xff/0x1b0
    [<00000000d1ffd156>] do_syscall_64+0x40/0x80
    [<00000000ab1408c8>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Alexander Aring fcef0e6c27 fs: dlm: fix lowcomms_start error case
This patch fixes the error path handling in lowcomms_start(). We need to
cleanup some static allocated data structure and cleanup possible
workqueue if these have started.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-06-02 11:53:04 -05:00
Dave Chinner 509201163f xfs: remove xfs_perag_t
Almost unused, gets rid of another typedef.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:51 +10:00
Dave Chinner f40aadb2bb xfs: use perag through unlink processing
Unlinked lists are held in the perag, and freeing of inodes needs to
be passed a perag, too, so look up the perag early in the unlink
processing and use it throughout.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02 10:48:51 +10:00
Dave Chinner 8237fbf53d xfs: clean up and simplify xfs_dialloc()
Because it's a mess.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 309161f660 xfs: inode allocation can use a single perag instance
Now that we've internalised the two-phase inode allocation, we can
now easily make the AG selection and allocation atomic from the
perspective of a single perag context. This will ensure AGs going
offline/away cannot occur between the selection and allocation
steps.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner b652afd937 xfs: get rid of xfs_dir_ialloc()
This is just a simple wrapper around the per-ag inode allocation
that doesn't need to exist. The internal mechanism to select and
allocate within an AG does not need to be exposed outside
xfs_ialloc.c, and it being exposed simply makes it harder to follow
the code and simplify it.

This is simplified by internalising xf_dialloc_select_ag() and
xfs_dialloc_ag() into a single xfs_dialloc() function and then
xfs_dir_ialloc() can go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 89b1f55a29 xfs: collapse AG selection for inode allocation
xfs_dialloc_select_ag() does a lot of repetitive work. It first
calls xfs_ialloc_ag_select() to select the AG to start allocation
attempts in, which can do up to two entire loops across the perags
that inodes can be allocated in. This is simply checking if there is
spce available to allocate inodes in an AG, and it returns when it
finds the first candidate AG.

xfs_dialloc_select_ag() then does it's own iterative walk across
all the perags locking the AGIs and trying to allocate inodes from
the locked AG. It also doesn't limit the search to mp->m_maxagi,
so it will walk all AGs whether they can allocate inodes or not.

Hence if we are really low on inodes, we could do almost 3 entire
walks across the whole perag range before we find an allocation
group we can allocate inodes in or report ENOSPC.

Because xfs_ialloc_ag_select() returns on the first candidate AG it
finds, we can simply do these checks directly in
xfs_dialloc_select_ag() before we lock and try to allocate inodes.
This reduces the inode allocation pass down to 2 perag sweeps at
most - one for aligned inode cluster allocation and if we can't
allocate full, aligned inode clusters anywhere we'll do another pass
trying to do sparse inode cluster allocation.

This also removes a big chunk of duplicate code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 4268547305 xfs: simplify xfs_dialloc_select_ag() return values
The only caller of xfs_dialloc_select_ag() will always return
-ENOSPC to it's caller if the agbp returned from
xfs_dialloc_select_ag() is NULL. IOWs, failure to find a candidate
AGI we can allocate inodes from is always an ENOSPC condition, so
move this logic up into xfs_dialloc_select_ag() so we can simplify
the return logic in this function.

xfs_dialloc_select_ag() now only ever returns 0 with a locked
agbp, or an error with no agbp.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 50f02fe333 xfs: remove agno from btree cursor
Now that everything passes a perag, the agno is not needed anymore.
Convert all the users to use pag->pag_agno instead and remove the
agno from the cursor. This was largely done as an automated search
and replace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 7b13c51551 xfs: use perag for ialloc btree cursors
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 289d38d22c xfs: convert allocbt cursors to use perags
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner a81a06211f xfs: convert refcount btree cursor to use perags
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner fa9c3c1973 xfs: convert rmap btree cursor to using a perag
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner be9fb17d88 xfs: add a perag to the btree cursor
Which will eventually completely replace the agno in it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02 10:48:24 +10:00
Dave Chinner 58d43a7e32 xfs: pass perags around in fsmap data dev functions
Needs a [from, to] ranged AG walk, and the perag to be stuffed into
the info structure for callouts to use.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 30933120ad xfs: push perags through the ag reservation callouts
We currently pass an agno from the AG reservation functions to the
individual feature accounting functions, which in future may have to
do perag lookups to access per-AG state. Instead, pre-emptively
plumb the perag through from the highest AG reservation layer to the
feature callouts so they won't have to look it up again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02 10:48:24 +10:00
Dave Chinner 45d0662117 xfs: pass perags through to the busy extent code
All of the callers of the busy extent API either have perag
references available to use so we can pass a perag to the busy
extent functions rather than having them have to do unnecessary
lookups.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 7f8d3b3ca6 xfs: convert secondary superblock walk to use perags
Clean up the last external manual AG walk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 6f4118fc64 xfs: convert xfs_iwalk to use perag references
Rather than manually walking the ags and passing agnunbers around,
pass the perag for the AG we are currently working on around in the
iwalk structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 934933c3ee xfs: convert raw ag walks to use for_each_perag
Convert the raw walks to an iterator, pulling the current AG out of
pag->pag_agno instead of the loop iterator variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner f250eedcf7 xfs: make for_each_perag... a first class citizen
for_each_perag_tag() is defined in xfs_icache.c for local use.
Promote this to xfs_ag.h and define equivalent iteration functions
so that we can use them to iterate AGs instead to replace open coded
perag walks and perag lookups.

We also convert as many of the straight forward open coded AG walks
to use these iterators as possible. Anything that is not a direct
conversion to an iterator is ignored and will be updated in future
commits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 07b6403a68 xfs: move perag structure and setup to libxfs/xfs_ag.[ch]
Move the xfs_perag infrastructure to the libxfs files that contain
all the per AG infrastructure. This helps set up for passing perags
around all the code instead of bare agnos with minimal extra
includes for existing files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 61aa005a5b xfs: prepare for moving perag definitions and support to libxfs
The perag structures really need to be defined with the rest of the
AG support infrastructure. The struct xfs_perag and init/teardown
has been placed in xfs_mount.[ch] because there are differences in
the structure between kernel and userspace. Mainly that userspace
doesn't have a lot of the internal stuff that the kernel has for
caches and discard and other such structures.

However, it makes more sense to move this to libxfs than to keep
this separation because we are now moving to use struct perags
everywhere in the code instead of passing raw agnumber_t values
about. Hence we shoudl really move the support infrastructure to
libxfs/xfs_ag.[ch].

To do this without breaking userspace, first we need to rearrange
the structures and code so that all the kernel specific code is
located together. This makes it simple for userspace to ifdef out
the all the parts it does not need, minimising the code differences
between kernel and userspace. The next commit will do the move...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Dave Chinner 9bbafc7191 xfs: move xfs_perag_get/put to xfs_ag.[ch]
They are AG functions, not superblock functions, so move them to the
appropriate location.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02 10:48:24 +10:00
Andreas Gruenbacher d5b8145455 Revert "gfs2: Fix mmap locking for write faults"
This reverts commit b7f55d928e.

As explained by Linus in [*], write faults on a mmap region are reads
from a filesysten point of view, so taking the inode glock exclusively
on write faults is incorrect.

Instead, when a page is marked writable, the .page_mkwrite vm operation
will be called, which is where the exclusive lock taking needs to
happen.  I got this wrong because of a broken test case that made me
believe .page_mkwrite isn't getting called when it actually is.

[*] https://lore.kernel.org/lkml/CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com/

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-06-01 23:16:42 +02:00
Darrick J. Wong 20bd8e63f3 xfs: remove unnecessary shifts
The superblock verifier already validates that (1 << blocklog) ==
blocksize, so use the value directly instead of doing math.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-01 12:53:59 -07:00
Darrick J. Wong a7bcb147fe xfs: clean up open-coded fs block unit conversions
Replace some open-coded fs block unit conversions with the standard
conversion macro.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-01 12:53:59 -07:00
Allison Henderson 4fd084dbbd xfs: Clean up xfs_attr_node_addname_clear_incomplete
We can use the helper function xfs_attr_node_remove_name to reduce
duplicate code in this function

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:49 -07:00
Allison Henderson 0e6acf29db xfs: Remove xfs_attr_rmtval_set
This function is no longer used, so it is safe to remove

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:49 -07:00
Allison Henderson 8f502a4009 xfs: Add delay ready attr set routines
This patch modifies the attr set routines to be delay ready. This means
they no longer roll or commit transactions, but instead return -EAGAIN
to have the calling routine roll and refresh the transaction.  In this
series, xfs_attr_set_args has become xfs_attr_set_iter, which uses a
state machine like switch to keep track of where it was when EAGAIN was
returned. See xfs_attr.h for a more detailed diagram of the states.

Two new helper functions have been added: xfs_attr_rmtval_find_space and
xfs_attr_rmtval_set_blk.  They provide a subset of logic similar to
xfs_attr_rmtval_set, but they store the current block in the delay attr
context to allow the caller to roll the transaction between allocations.
This helps to simplify and consolidate code used by
xfs_attr_leaf_addname and xfs_attr_node_addname. xfs_attr_set_args has
now become a simple loop to refresh the transaction until the operation
is completed.  Lastly, xfs_attr_rmtval_remove is no longer used, and is
removed.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:49:48 -07:00
Allison Henderson 2b74b03c13 xfs: Add delay ready attr remove routines
This patch modifies the attr remove routines to be delay ready. This
means they no longer roll or commit transactions, but instead return
-EAGAIN to have the calling routine roll and refresh the transaction. In
this series, xfs_attr_remove_args is merged with
xfs_attr_node_removename become a new function, xfs_attr_remove_iter.
This new version uses a sort of state machine like switch to keep track
of where it was when EAGAIN was returned. A new version of
xfs_attr_remove_args consists of a simple loop to refresh the
transaction until the operation is completed. A new XFS_DAC_DEFER_FINISH
flag is used to finish the transaction where ever the existing code used
to.

Calls to xfs_attr_rmtval_remove are replaced with the delay ready
version __xfs_attr_rmtval_remove. We will rename
__xfs_attr_rmtval_remove back to xfs_attr_rmtval_remove when we are
done.

xfs_attr_rmtval_remove itself is still in use by the set routines (used
during a rename).  For reasons of preserving existing function, we
modify xfs_attr_rmtval_remove to call xfs_defer_finish when the flag is
set.  Similar to how xfs_attr_remove_args does here.  Once we transition
the set routines to be delay ready, xfs_attr_rmtval_remove is no longer
used and will be removed.

This patch also adds a new struct xfs_delattr_context, which we will use
to keep track of the current state of an attribute operation. The new
xfs_delattr_state enum is used to track various operations that are in
progress so that we know not to repeat them, and resume where we left
off before EAGAIN was returned to cycle out the transaction. Other
members take the place of local variables that need to retain their
values across multiple function calls.  See xfs_attr.h for a more
detailed diagram of the states.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:47 -07:00
Allison Henderson 3f562d092b xfs: Hoist node transaction handling
This patch basically hoists the node transaction handling around the
leaf code we just hoisted.  This will helps setup this area for the
state machine since the goto is easily replaced with a state since it
ends with a transaction roll.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:46 -07:00
Allison Henderson 83c6e70789 xfs: Hoist xfs_attr_leaf_addname
This patch hoists xfs_attr_leaf_addname into the calling function.  The
goal being to get all the code that will require state management into
the same scope. This isn't particularly aesthetic right away, but it is a
preliminary step to merging in the state machine code.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:49:45 -07:00
Allison Henderson 5d954cc09f xfs: Hoist xfs_attr_node_addname
This patch hoists the later half of xfs_attr_node_addname into
the calling function.  We do this because it is this area that
will need the most state management, and we want to keep such
code in the same scope as much as possible

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:45 -07:00
Allison Henderson 6ca5a4a1f5 xfs: Add helper xfs_attr_node_addname_find_attr
This patch separates the first half of xfs_attr_node_addname into a
helper function xfs_attr_node_addname_find_attr.  It also replaces the
restart goto with an EAGAIN return code driven by a loop in the calling
function.  This looks odd now, but will clean up nicly once we introduce
the state machine.  It will also enable hoisting the last state out of
xfs_attr_node_addname with out having to plumb in a "done" parameter to
know if we need to move to the next state or not.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:44 -07:00
Allison Henderson f0f7c502c7 xfs: Separate xfs_attr_node_addname and xfs_attr_node_addname_clear_incomplete
This patch separate xfs_attr_node_addname into two functions.  This will
help to make it easier to hoist parts of xfs_attr_node_addname that need
state management

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:49:43 -07:00
Allison Henderson 6286514b63 xfs: Refactor xfs_attr_set_shortform
This patch is actually the combination of patches from the previous
version (v18).  Initially patch 3 hoisted xfs_attr_set_shortform, and
the next added the helper xfs_attr_set_fmt. xfs_attr_set_fmt is similar
the old xfs_attr_set_shortform. It returns 0 when the attr has been set
and no further action is needed. It returns -EAGAIN when shortform has
been transformed to leaf, and the calling function should proceed the
set the attr in leaf form.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-06-01 10:49:42 -07:00
Allison Henderson a8490f699f xfs: Add xfs_attr_node_remove_name
This patch pulls a new helper function xfs_attr_node_remove_name out
of xfs_attr_node_remove_step.  This helps to modularize
xfs_attr_node_remove_step which will help make the delayed attribute
code easier to follow

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-01 10:48:41 -07:00
Allison Henderson 4126c06e25 xfs: Reverse apply 72b97ea40d
Originally we added this patch to help modularize the attr code in
preparation for delayed attributes and the state machine it requires.
However, later reviews found that this slightly alters the transaction
handling as the helper function is ambiguous as to whether the
transaction is diry or clean.  This may cause a dirty transaction to be
included in the next roll, where previously it had not.  To preserve the
existing code flow, we reverse apply this commit.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 10:48:19 -07:00
Dai Ngo f8849e206e NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
Currently if __nfs4_proc_set_acl fails with NFS4ERR_BADOWNER it
re-enables the idmapper by clearing NFS_CAP_UIDGID_NOMAP before
retrying again. The NFS_CAP_UIDGID_NOMAP remains cleared even if
the retry fails. This causes problem for subsequent setattr
requests for v4 server that does not have idmapping configured.

This patch modifies nfs4_proc_set_acl to detect NFS4ERR_BADOWNER
and NFS4ERR_BADNAME and skips the retry, since the kernel isn't
involved in encoding the ACEs, and return -EINVAL.

Steps to reproduce the problem:

 # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt
 # touch /tmp/mnt/file1
 # chown 99 /tmp/mnt/file1
 # nfs4_setfacl -a A::unknown.user@xyz.com:wrtncy /tmp/mnt/file1
 Failed setxattr operation: Invalid argument
 # chown 99 /tmp/mnt/file1
 chown: changing ownership of ‘/tmp/mnt/file1’: Invalid argument
 # umount /tmp/mnt
 # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt
 # chown 99 /tmp/mnt/file1
 #

v2: detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skip retry
       in nfs4_proc_set_acl.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-06-01 13:16:17 -04:00
Jiapeng Chong 492109333c fs/jfs: Fix missing error code in lmLogInit()
The error code is missing in this code scenario, add the error code
'-EINVAL' to the return value 'rc.

Eliminate the follow smatch warning:

fs/jfs/jfs_logmgr.c:1327 lmLogInit() warn: missing error code 'rc'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2021-06-01 10:29:12 -05:00
Christoph Hellwig ab4b57057d block: move bd_part_count to struct gendisk
The bd_part_count value only makes sense for whole devices, so move it
to struct gendisk and give it a more descriptive name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:45:27 -06:00
Christoph Hellwig c8276b954d block: split __blkdev_put
Split __blkdev_put into one helper for the whole device, and one for
partitions as well as another shared helper for flushing the block
device inode mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:44:56 -06:00
Christoph Hellwig e54069acac block: move adjusting bd_part_count out of __blkdev_get
Keep in the callers and thus remove the for_part argument.  This mirrors
what is done on the blkdev_get side and slightly simplifies
blkdev_get_part as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@rehat.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:44:56 -06:00
Christoph Hellwig a8698707a1 block: move bd_mutex to struct gendisk
Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:44:32 -06:00
Christoph Hellwig 210a6d756f block: move sync_blockdev from __blkdev_put to blkdev_put
Do the early unlocked syncing even earlier to move more code out of
the recursive path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210525061301.2242282-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:43:32 -06:00
Christoph Hellwig 362529d928 block: split __blkdev_get
Split __blkdev_get into one helper for the whole device, and one for
opening partitions.  This removes the (bounded) recursion when opening
a partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210525061301.2242282-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01 07:43:32 -06:00
Christian Brauner dd8b477f9a
mount: Support "nosymfollow" in new mount api
Commit dab741e0e0 ("Add a "nosymfollow" mount option.") added support
for the "nosymfollow" mount option allowing to block following symlinks
when resolving paths. The mount option so far was only available in the
old mount api. Make it available in the new mount api as well. Bonus is
that it can be applied to a whole subtree not just a single mount.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <zwisler@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-06-01 12:09:27 +02:00
Dave Chinner e7d236a6fe xfs: move page freeing into _xfs_buf_free_pages()
Rather than open coding it just before we call
_xfs_buf_free_pages(). Also, rename the function to
xfs_buf_free_pages() as the leading underscore has no useful
meaning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner 02c5117386 xfs: merge _xfs_buf_get_pages()
Only called from one place now, so merge it into
xfs_buf_alloc_pages(). Because page array allocation is dependent on
bp->b_pages being null, always ensure that when the pages array is
freed we always set bp->b_pages to null.

Also convert the page array to use kmalloc() rather than
kmem_alloc() so we can use the gfp flags we've already calculated
for the allocation context instead of hard coding KM_NOFS semantics.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner c9fa563072 xfs: use alloc_pages_bulk_array() for buffers
Because it's more efficient than allocating pages one at a time in a
loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:36 +10:00
Dave Chinner 07b5c5add4 xfs: use xfs_buf_alloc_pages for uncached buffers
Use the newly factored out page allocation code. This adds
automatic buffer zeroing for non-read uncached buffers.

This also allows us to greatly simply the error handling in
xfs_buf_get_uncached(). Because xfs_buf_alloc_pages() cleans up
partial allocation failure, we can just call xfs_buf_free() in all
error cases now to clean up after failures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:35 +10:00
Dave Chinner 0a683794ac xfs: split up xfs_buf_allocate_memory
Based on a patch from Christoph Hellwig.

This splits out the heap allocation and page allocation portions of
the buffer memory allocation into two separate helper functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01 13:40:02 +10:00
Linus Torvalds c2131f7e73 Various gfs2 fixes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmC0vkAUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqCGQ/+JiCdfHQao3/W9KsIeA5YO5fbsQXi
 tElY61L4eM7F+gEe1mbMzr8sefbejv73aAMGWJJD06gLPz/wIPeW/fYnC4/gcQEn
 +jLjVb7taGaxOn0fioCqjU+esGW4wstYrAXLp6XZmLnMETmr7PCbOhohRG7sK1TX
 m8si6riMOiNw20MOHhUK9DFZ3rF4Q5Rp/vYaTDwoGpcORIv5bpPoQKYT1FMCTD9h
 5qI6ldOO2E4d9qXQXiCv2RqXElYqQxwxqvGP0Hj+HQLZQBCmJJYZNDqRwDunJTaN
 K9++1/XbTCFKEQz0UWz1x1k5fCDIewbzxX348aQjiLMkkpXr885AGhasAX8gRS3p
 D7Y4q6VCY3J5JzlCDfNWrTBd0abLJAjeJ70R71/kN/hgIY2PbU/CaPcyhUrp7rwH
 B6spZDXb2fBNdfYA5wmuUdSA9BRmw/MDpiGd9aQc5nv25YvZ5Apl9X4QSH2250vo
 MKTrlt90EyTmOgF6vRf28apVr41JO3PIXgMu+svZq769Ox2jSZJQT0UI4vzVThoP
 RGBsTDPtDL67OvNoC6H7Poc7ad+BRqtFxkwNCz7kkcwQlYkmVPUf49UC/pBnBV3M
 HtlkJdlhD7VEWqUPl3T02rTdLRXLuPIGw9Kk6gKiDCikONoD+icJ3fV7rWShMjhD
 O/KT/r3XM1V1sP8=
 =vV+Y
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
 "Various gfs2 fixes"

* tag 'gfs2-v5.13-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix use-after-free in gfs2_glock_shrink_scan
  gfs2: Fix mmap locking for write faults
  gfs2: Clean up revokes on normal withdraws
  gfs2: fix a deadlock on withdraw-during-mount
  gfs2: fix scheduling while atomic bug in glocks
  gfs2: Fix I_NEW check in gfs2_dinode_in
  gfs2: Prevent direct-I/O write fallback errors from getting lost
2021-05-31 05:57:22 -10:00
Linus Torvalds 36c795513a \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmC0vhsACgkQnJ2qBz9k
 QNlI9ggAjZSqIvNNs1w6VafSRY7XP5vItKAe0jhguD0o1ZtUI1gM1JlOJzbgt2z5
 gpm/4v4485h5JUXNrB5TeQ1woOOvFKzlUcIr+ZgUiyq2UgZj6PzvK599u2TFf1vc
 gLMAUx5YgWafr048orhcSBqaYic04LESQ17op+9UjgBB7ATbNjJmEBb/+WvGh9os
 8c4V9JrCTMdNJ5Rpc5+JsWAksgZKrW9VjTw8mHisWB0NIIPQWGCML8Z4ACzNObCW
 CrXL9xWgaQDov1okJSA0ZNkdatGhh4h/NxIZ2sLGg2F3bDfZwN+kFu6gqpxhTEVV
 v83aTAP3UxbK8bwRj0+lm/LImxULjA==
 =t4P5
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "A fix for permission checking with fanotify unpriviledged groups.

  Also there's a small update in MAINTAINERS file for fanotify"

* tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: fix permission model of unprivileged group
  MAINTAINERS: Add Matthew Bobrowski as a reviewer
2021-05-31 05:52:22 -10:00
Hillf Danton 1ab19c5de4 gfs2: Fix use-after-free in gfs2_glock_shrink_scan
The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to
remove the glock from the lru list in __gfs2_glock_put().

On the shrink scan path, the same flag is cleared under lru_lock but because
of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the
put side can be made without deleting the glock from the lru list.

Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to
ensure correct behavior on both sides - clear GLF_LRU after list_del under
lru_lock.

Reported-by: syzbot <syzbot+34ba7ddbf3021981a228@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-31 12:03:28 +02:00
Greg Kroah-Hartman 92722bac5f Merge 5.13-rc4 into driver-core-next
We need the driver core fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-31 09:10:03 +02:00
Linus Torvalds 75b9c727af Fixes for 5.13-rc4:
- Fix a bug where unmapping operations end earlier than expected, which
   can cause chaos on multi-block directory and symlink shrink
   operations.
 - Fix an erroneous assert that can trigger if we try to transition a
   bmap structure from btree format to extents format with zero extents.
   This was exposed by xfs/538.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCvxs0ACgkQ+H93GTRK
 tOsfAQ//fAtDZkjKYKHhWUFoyG6kYNsIZr7wf+kow8jJgeWUibwtUYYQQV/RCRtJ
 zR+Tiys9ZorAReYpzq69s1LbZg/Zz1GT4bPgq/9Icni9x8EXIS6MVaWJHjnkFtKD
 7IqDztV3tC3XgSuAEsjey5PA1V1xpSgxxVtaT1Q2BcY8zqf2bnEPzM/rpKdmE++x
 jlTYrgLBctI24nbmTX2Y/+Te1UWXjM4QiV/EBHiUPedAqJZhwA0hU7hJJv/9I/EG
 /GOjisxhAonKR7fr7wPE+LwJMaxxK4LAt3aLZmGKpm3smSYX8O6sGnJv9VI/stsS
 wRD9c3wzLvfmqL5MXeAYq83u3s5DuFsfqmYD2U49xHlFF9tvLTT5S0Pdi/Qiq962
 n3wabi0slBCdzeY3xXXr9M4cCLL6utYY8Vfi7KvBiDHdtCRZUU33/SAwxZzvhHQv
 0XN+2sqnIn3jM9xg342+/BZbi4+SX7h28qixmgxCo+hez96GHuwhdN5GUVa5lF0r
 4uRPn+VVaOJPcNRx69/iTkrJ1R4YPqedCkgLShs6lZX5Ct92UtANLzqmm0xCZ6U5
 Pe7WjXO6aVAugEsVd9qnPdx4o/+sabs6CEQ65BbYyKvOBVg1XWH0yUEeKyGRYDt9
 21ol7QSKJ6+HIg473j8A3omM5s2XKUT8NZMmRm4EojNI+j4O3R4=
 =y7+K
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This week's pile mitigates some decades-old problems in how extent
  size hints interact with realtime volumes, fixes some failures in
  online shrink, and fixes a problem where directory and symlink
  shrinking on extremely fragmented filesystems could fail.

  The most user-notable change here is to point users at our (new) IRC
  channel on OFTC. Freedom isn't free, it costs folks like you and me;
  and if you don't kowtow, they'll expel everyone and take over your
  channel. (Ok, ok, that didn't fit the song lyrics...)

  Summary:

   - Fix a bug where unmapping operations end earlier than expected,
     which can cause chaos on multi-block directory and symlink shrink
     operations.

   - Fix an erroneous assert that can trigger if we try to transition a
     bmap structure from btree format to extents format with zero
     extents. This was exposed by xfs/538"

* tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: bunmapi has unnecessary AG lock ordering issues
  xfs: btree format inode forks can have zero extents
  xfs: add new IRC channel to MAINTAINERS
  xfs: validate extsz hints against rt extent size when rtinherit is set
  xfs: standardize extent size hint validation
  xfs: check free AG space when making per-AG reservations
2021-05-29 17:47:19 -10:00
Pavel Begunkov 216e583596 io_uring: fix misaccounting fix buf pinned pages
As Andres reports "... io_sqe_buffer_register() doesn't initialize imu.
io_buffer_account_pin() does imu->acct_pages++, before calling
io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM.

Initialise the field.

Reported-by: Andres Freund <andres@anarazel.de>
Fixes: 41edf1a5ec ("io_uring: keep table of pointers to ubufs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-29 19:27:21 -06:00
Linus Torvalds e1a9e3db3b Driver core fixes for 5.13-rc4
Here are 3 small driver core / debugfs fixes for 5.13-rc4:
   - debugfs fix for incorrect "lockdown" mode for selinux accesses
   - 2 device link changes, one bugfix and one cleanup
 
 All of these have been in linux-next for over a week with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYLJMrQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynf8ACgvsZCX7Wi3GYtFovfomHsCRKpZBsAn0sqfSAL
 TXHePEnj2tJ5c22TSqSt
 =Zx6Z
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are three small driver core / debugfs fixes for 5.13-rc4:

   - debugfs fix for incorrect "lockdown" mode for selinux accesses

   - two device link changes, one bugfix and one cleanup

  All of these have been in linux-next for over a week with no reported
  problems"

* tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  drivers: base: Reduce device link removal code duplication
  drivers: base: Fix device link removal
  debugfs: fix security_locked_down() call for SELinux
2021-05-29 06:33:28 -10:00
Linus Torvalds b3dbbae609 io_uring-5.13-2021-05-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCxY4wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqJnD/sEHg2ZVzc3CUtvLI11C+O4nkqzUpetOD8I
 iKtvCYKYNTATOPLGQjsznNTTVcUhN4Mud9XWHjyR3nli98fwRrzLuK3EfJjuq1cL
 v6DZVuYKq4k6s0QN6K8yTMslYBQTmk85l8rvXs06jVqDadnnVc+JdfWWBDducs0e
 56Wtmlse18PhzfDjqtsjAOQBjpv4bhQaJTrYOHcEIqFiih2ZpSvyP3SLED7/nvoe
 Q8MNF0Htff/oVbUEzp/NfhHoOFIZ17wwPV3fRC7zat2Dp4R9ZxpScmozLn8PkdO9
 DW+rKpuCbYTYwY1p11cQ5EhiNWNfPMxX4YXovUP9z+M2cgGUK1IhWQRM83L9bAXt
 r/9Md5WjnNpeDr6/YW6uMe1lOrrEy2ZJfNJ2JJbiXo6CWiz+g2qfHLOxwVsEnfoy
 vZoSbDD8ItZDooaXDFGEp1PLpkka4vt/6Ebg0fUtEeG8QQ48eG5L9xpPMSjm90y9
 /UKZdS1pvSl/x6he+RDPg4aVGBWIhGJhv+Q22hNTO3g5u5QE+hXLvFh0QvoOkDQK
 FGlhIa431EiOdm3rdFCG2I4kH1QzQTO6XLHpoVabGXJULPvS2ztnHCz3pYqOU9w1
 Mh12t1RtWzvcTkyOutfsjVqszV3kTl6O6GkI8CiqqjomnbbfORj6CDsi7h9RFZI+
 HtnY2GbSJg==
 =dfLl
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few minor fixes:

   - Fix an issue with hashed wait removal on exit (Zqiang, Pavel)

   - Fix a recent data race introduced in this series (Marco)"

* tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block:
  io_uring: fix data race to avoid potential NULL-deref
  io-wq: Fix UAF when wakeup wqe in hash waitqueue
  io_uring/io-wq: close io-wq full-stop gap
2021-05-28 14:35:55 -10:00
Linus Torvalds 7c0ec89d31 3 SMB3 fixes, two for stable, and the other fixes a problem pointed out with a recently added ioctl
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCxbv4ACgkQiiy9cAdy
 T1GkPwwAqq0tvNDZnQ6aAur1jwHMiOIAydvpgNKXlKkiXYu+qQbMRgOdUsOtjM00
 Idi8PHKkID2X63HeiwLwoQTfTXGcs6I4UM1iOslEs2ZaX+Fkgo5PG25lIFRTAsqR
 tqYGqGi6yL6TWE7PlVJxr3QuwGeMr7B8X9A0lTZ7YJwslhByK8ymasPdF+jSgPQI
 zDOuAeXiYvlph8sCftWX7gF34aBfKgiH8LhA6M2SY5S16g7LwtXUJjq1PJactoD3
 +nEPyCtRoN6ohScKNVnM8JDpOKIrM+mJ42RG28ZLo6//8so0SFcUdC8VhECBxOWL
 9WkoL2GxRV0LoRnzCZS30EpAi/eQU+QlTrPueGp+n8GjauJMDPoxJ2l6UXox6CLm
 8CqwxKATG6WbrdcGhaVbIxVbAWC7Ze271C/7L61R5K+RmDTXc6jI4vAIw1Pib4o+
 CG6XtxHya5PM0zvyLgU28M6aY+WExbwnkSQKvI2FJZkOVG0xdFCy2O1QLLRBChmn
 a6hsA05a
 =8nZ2
 -----END PGP SIGNATURE-----

Merge tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three SMB3 fixes.

  Two for stable, and the other fixes a problem pointed out with a
  recently added ioctl"

* tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: change format of CIFS_FULL_KEY_DUMP ioctl
  cifs: fix string declarations and assignments in tracepoints
  cifs: set server->cipher_type to AES-128-CCM for SMB3.0
2021-05-28 14:15:47 -10:00
Linus Torvalds 5ff2756afd NFS client bugfixes for Linux 5.13
Highlights include:
 
 Stable fixes
 - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
 - Fix Oops in xs_tcp_send_request() when transport is disconnected
 - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
 
 Bugfixes
 - Fix instances where signal_pending() should be fatal_signal_pending()
 - fix an incorrect limit in filelayout_decode_layout()
 - Fixes for the SUNRPC backlogged RPC queue
 - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
 - Revert commit 586a0787ce ("Clean up rpcrdma_prepare_readch()")
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCvomgACgkQZwvnipYK
 APLg6xAAqlR/HLNYOLAToQ6d9wzrL6Po3x8Lx7VURjCkFaoKB/jMq3Zbu/K8mZ+X
 CC6/XFtB9AikloK7sle6lRPuwwPL6y+vOML0Ais/dkYPNkbhe9ylf0rsYQiPljXT
 8PAcqn8FXTZ9fKpKU8Quw24X1Jfkk6zUEeMy50HYDBfTx+gYEojMEKa6cl4URGzO
 2JpuBO4Ku/vWDOPj7bWBX9wi7wkrJjGSYDnx1A5SOgUdV87H8VJkbTo9vVdEwFoE
 OtE8MQmFhdton0u9+MKImFQdVfxoYLB1Ig1G45NXGHee91dwfYU0U05THj7E/xP9
 RQWtmJcKdvY1w8sRK/PNEHo43Vkow4usffSrIWNBZ6aO5EkbQFn1tmKMSDtsrkZ2
 ONMfKBiEhhQSy+QRXMR/RC86t4dsQ8SApu62qQT4VuuXqzYhrBum2DqkW0X6Zcti
 gi17+PfjRbgWNvul2yegBvDU016H324aCeT9nfWe0D9iwF7tPK4xsuNTYrWwbFOA
 YFAecIXoyBRtbIV6NZ95/+P5HEFBLAYewEVLpdAOBGQ9fjO023ERiC2sitl5P+ku
 v6V+4HAtBgcPfm/8BZwUYUYBpXnnTZFTizqdJdGGydPXeC671gANPJe6e4xOttCK
 frXFGd9OOqPSXdsRZLUVvhczTOFOGa/UVVG0GxIr4ggy8oKjDMk=
 =66ks
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
"Stable fixes:
   - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
   - Fix Oops in xs_tcp_send_request() when transport is disconnected
   - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()

  Bugfixes:
   - Fix instances where signal_pending() should be fatal_signal_pending()
   - fix an incorrect limit in filelayout_decode_layout()
   - Fixes for the SUNRPC backlogged RPC queue
   - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
   - Revert commit 586a0787ce ("Clean up rpcrdma_prepare_readch()")"

* tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Remove trailing semicolon in macros
  xprtrdma: Revert 586a0787ce
  NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
  NFS: Clean up reset of the mirror accounting variables
  NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
  NFS: Fix an Oopsable condition in __nfs_pageio_add_request()
  SUNRPC: More fixes for backlog congestion
  SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected
  NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
  SUNRPC in case of backlog, hand free slots directly to waiting task
  pNFS/NFSv4: Remove redundant initialization of 'rd_size'
  NFS: fix an incorrect limit in filelayout_decode_layout()
  fs/nfs: Use fatal_signal_pending instead of signal_pending
2021-05-28 08:53:19 -10:00
Christian Brauner cfe80306a0
open: don't silently ignore unknown O-flags in openat2()
The new openat2() syscall verifies that no unknown O-flag values are
set and returns an error to userspace if they are while the older open
syscalls like open() and openat() simply ignore unknown flag values:

  #define O_FLAG_CURRENTLY_INVALID (1 << 31)
  struct open_how how = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
          .resolve = 0,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

  /* succeeds */
  fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

However, openat2() silently truncates the upper 32 bits meaning:

  #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
  #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

  struct open_how how_lowe32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
  };

  struct open_how how_upper32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

  /* succeeds */
  fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

Fix this by preventing the immediate truncation in build_open_flags().

There's a snafu here though stripping FMODE_* directly from flags would
cause the upper 32 bits to be truncated as well due to integer promotion
rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

In addition, struct open_flags currently defines flags to be 32 bit
which is reasonable. If we simply were to bump it to 64 bit we would
need to change a lot of code preemptively which doesn't seem worth it.
So simply add a compile-time check verifying that all currently known
O_* flags are within the 32 bit range and fail to build if they aren't
anymore.

This change shouldn't regress old open syscalls since they silently
truncate any unknown values anyway. It is a tiny semantic change for
openat2() but it is very unlikely people pass ing > 32 bit unknown flags
and the syscall is relatively new too.

Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-05-28 17:44:37 +02:00
Filipe Manana 76a6d5cd74 btrfs: fix deadlock when cloning inline extents and low on available space
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:

1) When starting writeback for an inode with a very small dirty range that
   fits in an inline extent, we deadlock during the writeback when trying
   to insert the inline extent, at cow_file_range_inline(), if the extent
   is going to be located in the leaf for which we are already holding a
   read lock;

2) After successfully starting writeback, for non-inline extent cases,
   the async reclaim thread will hang waiting for an ordered extent to
   complete if the ordered extent completion needs to modify the leaf
   for which the clone task is holding a read lock (for adding or
   replacing file extent items). So the cloning task will wait forever
   on the async reclaim thread to make progress, which in turn is
   waiting for the ordered extent completion which in turn is waiting
   to acquire a write lock on the same leaf.

So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.

Fixes: 05a5a7621c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:52 +02:00
Filipe Manana ea7036de0d btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.

For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.

The following example reproduces the problem and explains how it happens:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  # Create our test file with 2 preallocated extents. Leave a 1M hole
  # between them to ensure that we get two file extent items that will
  # never be merged into a single one. The extents are contiguous on disk,
  # which will later result in the checksums for their data to be merged
  # into a single checksum item in the csums btree.
  #
  $ xfs_io -f \
           -c "falloc 0 1M" \
           -c "falloc 3M 3M" \
           /mnt/foobar

  # Now write to the second extent and leave only 1M of it as unwritten,
  # which corresponds to the file range [4M, 5M[.
  #
  # Then fsync the file to flush delalloc and to clear full sync flag from
  # the inode, so that a future fsync will use the fast code path.
  #
  # After the writeback triggered by the fsync we have 3 file extent items
  # that point to the second extent we previously allocated:
  #
  # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
  #    file range [3M, 4M[
  #
  # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
  #    the file range [4M, 5M[
  #
  # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
  #    file range [5M, 6M[
  #
  # All these file extent items have a generation of 6, which is the ID of
  # the transaction where they were created. The split of the original file
  # extent item is done at btrfs_mark_extent_written() when ordered extents
  # complete for the file ranges [3M, 4M[ and [5M, 6M[.
  #
  $ xfs_io -c "pwrite -S 0xab 3M 1M" \
           -c "pwrite -S 0xef 5M 1M" \
           -c "fsync" \
           /mnt/foobar

  # Commit the current transaction. This wipes out the log tree created by
  # the previous fsync.
  sync

  # Now write to the unwritten range of the second extent we allocated,
  # corresponding to the file range [4M, 5M[, and fsync the file, which
  # triggers the fast fsync code path.
  #
  # The fast fsync code path sees that there is a new extent map covering
  # the file range [4M, 5M[ and therefore it will log a checksum item
  # covering the range [1M, 2M[ of the second extent we allocated.
  #
  # Also, after the fsync finishes we no longer have the 3 file extent
  # items that pointed to 3 sections of the second extent we allocated.
  # Instead we end up with a single file extent item pointing to the whole
  # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
  # current transaction ID). This is due to the file extent item merging we
  # do when completing ordered extents into ranges that point to unwritten
  # (preallocated) extents. This merging is done at
  # btrfs_mark_extent_written().
  #
  $ xfs_io -c "pwrite -S 0xcd 4M 1M" \
           -c "fsync" \
           /mnt/foobar

  # Now do some write to our file outside the range of the second extent
  # that we allocated with fallocate() and truncate the file size from 6M
  # down to 5M.
  #
  # The truncate operation sets the full sync runtime flag on the inode,
  # forcing the next fsync to use the slow code path. It also changes the
  # length of the second file extent item so that it represents the file
  # range [3M, 5M[ and not the range [3M, 6M[ anymore.
  #
  # Finally fsync the file. Since this is a fsync that triggers the slow
  # code path, it will remove all items associated to the inode from the
  # log tree and then it will scan for file extent items in the
  # fs/subvolume tree that have a generation matching the current
  # transaction ID, which is 7. This means it will log 2 file extent
  # items:
  #
  # 1) One for the first extent we allocated, covering the file range
  #    [0, 1M[
  #
  # 2) Another for the first 2M of the second extent we allocated,
  #    covering the file range [3M, 5M[
  #
  # When logging the first file extent item we log a single checksum item
  # that has all the checksums for the entire extent.
  #
  # When logging the second file extent item, we also lookup for the
  # checksums that are associated with the range [0, 2M[ of the second
  # extent we allocated (file range [3M, 5M[), and then we log them with
  # btrfs_csum_file_blocks(). However that results in ending up with a log
  # that has two checksum items with ranges that overlap:
  #
  # 1) One for the range [1M, 2M[ of the second extent we allocated,
  #    corresponding to the file range [4M, 5M[, which we logged in the
  #    previous fsync that used the fast code path;
  #
  # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
  #    extents, respectively, corresponding to the files ranges [0, 1M[
  #    and [3M, 5M[. This one was added during this last fsync that uses
  #    the slow code path and overlaps with the previous one logged by
  #    the previous fast fsync.
  #
  # This happens because when logging the checksums for the second
  # extent, we notice they start at an offset that matches the end of the
  # checksums item that we logged for the first extent, and because both
  # extents are contiguous on disk, btrfs_csum_file_blocks() decides to
  # extend that existing checksums item and append the checksums for the
  # second extent to this item. The end result is we end up with two
  # checksum items in the log tree that have overlapping ranges, as
  # listed before, resulting in the fsync to fail with -EIO and aborting
  # the transaction, turning the filesystem into RO mode.
  #
  $ xfs_io -c "pwrite -S 0xff 0 1M" \
           -c "truncate 5M" \
           -c "fsync" \
           /mnt/foobar
  fsync: Input/output error

After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:

  $ dmesg
  (...)
  [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
  [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
  [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
  [756289.563654] 	item 0 key (257 1 0) itemoff 16123 itemsize 160
  [756289.564649] 		inode generation 6 size 5242880 mode 100600
  [756289.565636] 	item 1 key (257 12 256) itemoff 16107 itemsize 16
  [756289.566694] 	item 2 key (257 108 0) itemoff 16054 itemsize 53
  [756289.567725] 		extent data disk bytenr 13631488 nr 1048576
  [756289.568697] 		extent data offset 0 nr 1048576 ram 1048576
  [756289.569689] 	item 3 key (257 108 1048576) itemoff 16001 itemsize 53
  [756289.570682] 		extent data disk bytenr 0 nr 0
  [756289.571363] 		extent data offset 0 nr 2097152 ram 2097152
  [756289.572213] 	item 4 key (257 108 3145728) itemoff 15948 itemsize 53
  [756289.573246] 		extent data disk bytenr 14680064 nr 3145728
  [756289.574121] 		extent data offset 0 nr 2097152 ram 3145728
  [756289.574993] 	item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
  [756289.576113] 	item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
  [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
  [756289.578644] ------------[ cut here ]------------
  [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
  [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G        W         5.12.0-rc8-btrfs-next-87 #1
  [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [756289.595122] Code: 5d c3 e8 76 60 (...)
  [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
  [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
  [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
  [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
  [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
  [756289.602278] FS:  00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
  [756289.603217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
  [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [756289.606400] Call Trace:
  [756289.606704]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
  [756289.607313]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
  [756289.608040]  submit_one_bio+0x61/0x70 [btrfs]
  [756289.608587]  btree_write_cache_pages+0x587/0x610 [btrfs]
  [756289.609258]  ? free_debug_processing+0x1d5/0x240
  [756289.609812]  ? __module_address+0x28/0xf0
  [756289.610298]  ? lock_acquire+0x1a0/0x3e0
  [756289.610754]  ? lock_acquired+0x19f/0x430
  [756289.611220]  ? lock_acquire+0x1a0/0x3e0
  [756289.611675]  do_writepages+0x43/0xf0
  [756289.612101]  ? __filemap_fdatawrite_range+0xa4/0x100
  [756289.612800]  __filemap_fdatawrite_range+0xc5/0x100
  [756289.613393]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
  [756289.614085]  btrfs_sync_log+0x21c/0xf20 [btrfs]
  [756289.614661]  ? finish_wait+0x90/0x90
  [756289.615096]  ? __mutex_unlock_slowpath+0x45/0x2a0
  [756289.615661]  ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
  [756289.616338]  ? lock_acquire+0x1a0/0x3e0
  [756289.616801]  ? lock_acquired+0x19f/0x430
  [756289.617284]  ? lock_acquire+0x1a0/0x3e0
  [756289.617750]  ? lock_release+0x214/0x470
  [756289.618221]  ? lock_acquired+0x19f/0x430
  [756289.618704]  ? dput+0x20/0x4a0
  [756289.619079]  ? dput+0x20/0x4a0
  [756289.619452]  ? lockref_put_or_lock+0x9/0x30
  [756289.619969]  ? lock_release+0x214/0x470
  [756289.620445]  ? lock_release+0x214/0x470
  [756289.620924]  ? lock_release+0x214/0x470
  [756289.621415]  btrfs_sync_file+0x46a/0x5b0 [btrfs]
  [756289.621982]  do_fsync+0x38/0x70
  [756289.622395]  __x64_sys_fsync+0x10/0x20
  [756289.622907]  do_syscall_64+0x33/0x80
  [756289.623438]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [756289.624063] RIP: 0033:0x7f08b27fbb7b
  [756289.624588] Code: 0f 05 48 3d 00 (...)
  [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
  [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
  [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
  [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
  [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
  [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
  [756289.631819] irq event stamp: 0
  [756289.632188] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
  [756289.633893] softirqs last  enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
  [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
  [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
  [756289.637082] BTRFS info (device sdc): forced readonly

Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:

  27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
  b84b8390d6 ("Btrfs: fix file read corruption after extent cloning and fsync")
  40e046acbd ("Btrfs: fix missing data checksums after replaying a log tree")
  e289f03ea7 ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")

Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.

When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.

This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.

A test case for fstests will follow soon.

Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:36 +02:00
Josef Bacik dc09ef3562 btrfs: abort in rename_exchange if we fail to insert the second ref
Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange.  This happens because
we insert the inode ref for one side of the rename, and then for the
other side.  If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind.  Fix this by
aborting if we did the insert for the first inode ref.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:16 +02:00
Josef Bacik f96d44743a btrfs: check error value from btrfs_update_inode in tree log
Error injection testing uncovered a case where we ended up with invalid
link counts on an inode.  This happened because we failed to notice an
error when updating the inode while replaying the tree log, and
committed the transaction with an invalid file system.

Fix this by checking the return value of btrfs_update_inode.  This
resolved the link count errors I was seeing, and we already properly
handle passing up the error values in these paths.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:13 +02:00
Josef Bacik 011b28acf9 btrfs: fixup error handling in fixup_inode_link_counts
This function has the following pattern

	while (1) {
		ret = whatever();
		if (ret)
			goto out;
	}
	ret = 0
out:
	return ret;

However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.

Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the

	ret = 0;
out:

bit so everybody can break if there is an error, which will allow for
proper error handling to occur.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:08 +02:00
Josef Bacik d61bec08b9 btrfs: mark ordered extent and inode with error if we fail to finish
While doing error injection testing I saw that sometimes we'd get an
abort that wouldn't stop the current transaction commit from completing.
This abort was coming from finish ordered IO, but at this point in the
transaction commit we should have gotten an error and stopped.

It turns out the abort came from finish ordered io while trying to write
out the free space cache.  It occurred to me that any failure inside of
finish_ordered_io isn't actually raised to the person doing the writing,
so we could have any number of failures in this path and think the
ordered extent completed successfully and the inode was fine.

Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
marking the mapping of the inode with mapping_set_error, so any callers
that simply call fdatawait will also get the error.

With this we're seeing the IO error on the free space inode when we fail
to do the finish_ordered_io.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:31:01 +02:00
Josef Bacik 856bd270dc btrfs: return errors from btrfs_del_csums in cleanup_ref_head
We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail.  We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:30:55 +02:00
Josef Bacik b86652be7c btrfs: fix error handling in btrfs_del_csums
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent.  This occurred because the pattern
in btrfs_del_csums was

	while (1) {
		ret = btrfs_search_slot();
		if (ret < 0)
			break;
	}
	ret = 0;
out:
	btrfs_free_path(path);
	return ret;

If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out.  Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the

	ret = 0;
out:

pattern and simply allow break to have the proper error reporting.  With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:30:49 +02:00
Qu Wenruo 4c80a97d7b btrfs: fix compressed writes that cross stripe boundary
[BUG]
When running btrfs/027 with "-o compress" mount option, it always
crashes with the following call trace:

  BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/volumes.c:6651!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G           OE     5.13.0-rc2-custom+ #26
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs]
  Call Trace:
   btrfs_submit_compressed_write+0x2d7/0x470 [btrfs]
   submit_compressed_extents+0x3b0/0x470 [btrfs]
   ? mark_held_locks+0x49/0x70
   btrfs_work_helper+0x131/0x3e0 [btrfs]
   process_one_work+0x28f/0x5d0
   worker_thread+0x55/0x3c0
   ? process_one_work+0x5d0/0x5d0
   kthread+0x141/0x160
   ? __kthread_bind_mask+0x60/0x60
   ret_from_fork+0x22/0x30
  ---[ end trace 63113a3a91f34e68 ]---

[CAUSE]
The critical message before the crash means we have a bio at logical
bytenr 298901504 length 12288, but only 8192 bytes can fit into one
stripe, the remaining 4096 bytes go to another stripe.

In btrfs, all bios are properly split to avoid cross stripe boundary,
but commit 764c7c9a46 ("btrfs: zoned: fix parallel compressed writes")
changed the behavior for compressed writes.

Previously if we find our new page can't be fitted into current stripe,
ie. "submit == 1" case, we submit current bio without adding current
page.

       submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);

   page->mapping = NULL;
   if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
       PAGE_SIZE) {

But after the modification, we will add the page no matter if it crosses
stripe boundary, leading to the above crash.

       submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);

   if (pg_index == 0 && use_append)
           len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
   else
           len = bio_add_page(bio, page, PAGE_SIZE, 0);

   page->mapping = NULL;
   if (submit || len < PAGE_SIZE) {

[FIX]
It's no longer possible to revert to the original code style as we have
two different bio_add_*_page() calls now.

The new fix is to skip the bio_add_*_page() call if @submit is true.

Also to avoid @len to be uninitialized, always initialize it to zero.

If @submit is true, @len will not be checked.
If @submit is not true, @len will be the return value of
bio_add_*_page() call.
Either way, the behavior is still the same as the old code.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: 764c7c9a46 ("btrfs: zoned: fix parallel compressed writes")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-27 23:30:38 +02:00
Aurelien Aptel 1bb5681067 cifs: change format of CIFS_FULL_KEY_DUMP ioctl
Make CIFS_FULL_KEY_DUMP ioctl able to return variable-length keys.

* userspace needs to pass the struct size along with optional
  session_id and some space at the end to store keys
* if there is enough space kernel returns keys in the extra space and
  sets the length of each key via xyz_key_length fields

This also fixes the build error for get_user() on ARM.

Sample program:

	#include <stdlib.h>
	#include <stdio.h>
	#include <stdint.h>
	#include <sys/fcntl.h>
	#include <sys/ioctl.h>

	struct smb3_full_key_debug_info {
	        uint32_t   in_size;
	        uint64_t   session_id;
	        uint16_t   cipher_type;
	        uint8_t    session_key_length;
	        uint8_t    server_in_key_length;
	        uint8_t    server_out_key_length;
	        uint8_t    data[];
	        /*
	         * return this struct with the keys appended at the end:
	         * uint8_t session_key[session_key_length];
	         * uint8_t server_in_key[server_in_key_length];
	         * uint8_t server_out_key[server_out_key_length];
	         */
	} __attribute__((packed));

	#define CIFS_IOCTL_MAGIC 0xCF
	#define CIFS_DUMP_FULL_KEY _IOWR(CIFS_IOCTL_MAGIC, 10, struct smb3_full_key_debug_info)

	void dump(const void *p, size_t len) {
	        const char *hex = "0123456789ABCDEF";
	        const uint8_t *b = p;
	        for (int i = 0; i < len; i++)
	                printf("%c%c ", hex[(b[i]>>4)&0xf], hex[b[i]&0xf]);
	        putchar('\n');
	}

	int main(int argc, char **argv)
	{
	        struct smb3_full_key_debug_info *keys;
	        uint8_t buf[sizeof(*keys)+1024] = {0};
	        size_t off = 0;
	        int fd, rc;

	        keys = (struct smb3_full_key_debug_info *)&buf;
	        keys->in_size = sizeof(buf);

	        fd = open(argv[1], O_RDONLY);
	        if (fd < 0)
	                perror("open"), exit(1);

	        rc = ioctl(fd, CIFS_DUMP_FULL_KEY, keys);
	        if (rc < 0)
	                perror("ioctl"), exit(1);

	        printf("SessionId      ");
	        dump(&keys->session_id, 8);
	        printf("Cipher         %04x\n", keys->cipher_type);

	        printf("SessionKey     ");
	        dump(keys->data+off, keys->session_key_length);
	        off += keys->session_key_length;

	        printf("ServerIn Key   ");
	        dump(keys->data+off, keys->server_in_key_length);
	        off += keys->server_in_key_length;

	        printf("ServerOut Key  ");
	        dump(keys->data+off, keys->server_out_key_length);

	        return 0;
	}

Usage:

	$ gcc -o dumpkeys dumpkeys.c

Against Windows Server 2020 preview (with AES-256-GCM support):

	# mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.0,seal"
	# ./dumpkeys /mnt/somefile
	SessionId      0D 00 00 00 00 0C 00 00
	Cipher         0002
	SessionKey     AB CD CC 0D E4 15 05 0C 6F 3C 92 90 19 F3 0D 25
	ServerIn Key   73 C6 6A C8 6B 08 CF A2 CB 8E A5 7D 10 D1 5B DC
	ServerOut Key  6D 7E 2B A1 71 9D D7 2B 94 7B BA C4 F0 A5 A4 F8
	# umount /mnt

	With 256 bit keys:

	# echo 1 > /sys/module/cifs/parameters/require_gcm_256
	# mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.11,seal"
	# ./dumpkeys /mnt/somefile
	SessionId      09 00 00 00 00 0C 00 00
	Cipher         0004
	SessionKey     93 F5 82 3B 2F B7 2A 50 0B B9 BA 26 FB 8C 8B 03
	ServerIn Key   6C 6A 89 B2 CB 7B 78 E8 04 93 37 DA 22 53 47 DF B3 2C 5F 02 26 70 43 DB 8D 33 7B DC 66 D3 75 A9
	ServerOut Key  04 11 AA D7 52 C7 A8 0F ED E3 93 3A 65 FE 03 AD 3F 63 03 01 2B C0 1B D7 D7 E5 52 19 7F CC 46 B4

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27 15:26:32 -05:00
Shyam Prasad N eb06881805 cifs: fix string declarations and assignments in tracepoints
We missed using the variable length string macros in several
tracepoints. Fixed them in this change.

There's probably more useful macros that we can use to print
others like flags etc. But I'll submit sepawrate patches for
those at a future date.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: <stable@vger.kernel.org> # v5.12
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27 14:04:32 -05:00
Aurelien Aptel 6d2fcfe6b5 cifs: set server->cipher_type to AES-128-CCM for SMB3.0
SMB3.0 doesn't have encryption negotiate context but simply uses
the SMB2_GLOBAL_CAP_ENCRYPTION flag.

When that flag is present in the neg response cifs.ko uses AES-128-CCM
which is the only cipher available in this context.

cipher_type was set to the server cipher only when parsing encryption
negotiate context (SMB3.1.1).

For SMB3.0 it was set to 0. This means cipher_type value can be 0 or 1
for AES-128-CCM.

Fix this by checking for SMB3.0 and encryption capability and setting
cipher_type appropriately.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27 14:03:47 -05:00
David Howells f610a5a29c afs: Fix the nlink handling of dir-over-dir rename
Fix rename of one directory over another such that the nlink on the deleted
directory is cleared to 0 rather than being decremented to 1.

This was causing the generic/035 xfstest to fail.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-27 06:23:58 -10:00
Dave Chinner 0fe0bbe00a xfs: bunmapi has unnecessary AG lock ordering issues
large directory block size operations are assert failing because
xfs_bunmapi() is not completely removing fragmented directory blocks
like so:

XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677
....
Call Trace:
 xfs_dir2_shrink_inode+0x1a8/0x210
 xfs_dir2_block_to_sf+0x2ae/0x410
 xfs_dir2_block_removename+0x21a/0x280
 xfs_dir_removename+0x195/0x1d0
 xfs_rename+0xb79/0xc50
 ? avc_has_perm+0x8d/0x1a0
 ? avc_has_perm_noaudit+0x9a/0x120
 xfs_vn_rename+0xdb/0x150
 vfs_rename+0x719/0xb50
 ? __lookup_hash+0x6a/0xa0
 do_renameat2+0x413/0x5e0
 __x64_sys_rename+0x45/0x50
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

We are aborting the bunmapi() pass because of this specific chunk of
code:

                /*
                 * Make sure we don't touch multiple AGF headers out of order
                 * in a single transaction, as that could cause AB-BA deadlocks.
                 */
                if (!wasdel && !isrt) {
                        agno = XFS_FSB_TO_AGNO(mp, del.br_startblock);
                        if (prev_agno != NULLAGNUMBER && prev_agno > agno)
                                break;
                        prev_agno = agno;
                }

This is designed to prevent deadlocks in AGF locking when freeing
multiple extents by ensuring that we only ever lock in increasing
AG number order. Unfortunately, this also violates the "bunmapi will
always succeed" semantic that some high level callers depend on,
such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and
xfs_inactive_symlink_rmt().

This AG lock ordering was introduced back in 2017 to fix deadlocks
triggered by generic/299 as reported here:

https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/

This codebase is old enough that it was before we were defering all
AG based extent freeing from within xfs_bunmapi(). THat is, we never
actually lock AGs in xfs_bunmapi() any more - every non-rt based
extent free is added to the defer ops list, as is all BMBT block
freeing. And RT extents are not RT based, so there's no lock
ordering issues associated with them.

Hence this AGF lock ordering code is both broken and dead. Let's
just remove it so that the large directory block code works reliably
again.

Tested against xfs/538 and generic/299 which is the original test
that exposed the deadlocks that this code fixed.

Fixes: 5b094d6dac ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27 08:11:24 -07:00
Dave Chinner 991c2c5980 xfs: btree format inode forks can have zero extents
xfs/538 is assert failing with this trace when testing with
directory block sizes of 64kB:

XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608
....
Call Trace:
 xfs_bmap_btree_to_extents+0x2a9/0x470
 ? kmem_cache_alloc+0xe7/0x220
 __xfs_bunmapi+0x4ca/0xdf0
 xfs_bunmapi+0x1a/0x30
 xfs_dir2_shrink_inode+0x71/0x210
 xfs_dir2_block_to_sf+0x2ae/0x410
 xfs_dir2_block_removename+0x21a/0x280
 xfs_dir_removename+0x195/0x1d0
 xfs_remove+0x244/0x460
 xfs_vn_unlink+0x53/0xa0
 ? selinux_inode_unlink+0x13/0x20
 vfs_unlink+0x117/0x220
 do_unlinkat+0x1a2/0x2d0
 __x64_sys_unlink+0x42/0x60
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

This is a check to ensure that the extents have been read into
memory before we are doing a ifork btree manipulation. This assert
is bogus in the above case.

We have a fragmented directory block that has more extents in it
than can fit in extent format, so the inode data fork is in btree
format. xfs_dir2_shrink_inode() asks to remove all remaining 16
filesystem blocks from the inode so it can convert to short form,
and __xfs_bunmapi() removes all the extents. We now have a data fork
in btree format but have zero extents in the fork. This incorrectly
trips the xfs_need_iread_extents() assert because it assumes that an
empty extent btree means the extent tree has not been read into
memory yet. This is clearly not the case with xfs_bunmapi(), as it
has an explicit call to xfs_iread_extents() in it to pull the
extents into memory before it starts unmapping.

Also, the assert directly after this bogus one is:

	ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE);

Which covers the context in which it is legal to call
xfs_bmap_btree_to_extents just fine. Hence we should just remove the
bogus assert as it is clearly wrong and causes a regression.

The returns the test behaviour to the pre-existing assert failure in
xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to
remove all the extents in the range it was asked to unmap.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27 08:11:24 -07:00
Marco Elver b16ef427ad io_uring: fix data race to avoid potential NULL-deref
Commit ba5ef6dc8a ("io_uring: fortify tctx/io_wq cleanup") introduced
setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to
detect a data race between accesses to tctx->io_wq:

  write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1:
   io_uring_clean_tctx                  fs/io_uring.c:9042 [inline]
   __io_uring_cancel                    fs/io_uring.c:9136
   io_uring_files_cancel                include/linux/io_uring.h:16 [inline]
   do_exit                              kernel/exit.c:781
   do_group_exit                        kernel/exit.c:923
   get_signal                           kernel/signal.c:2835
   arch_do_signal_or_restart            arch/x86/kernel/signal.c:789
   handle_signal_work                   kernel/entry/common.c:147 [inline]
   exit_to_user_mode_loop               kernel/entry/common.c:171 [inline]
   ...
  read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0:
   io_uring_try_cancel_iowq             fs/io_uring.c:8911 [inline]
   io_uring_try_cancel_requests         fs/io_uring.c:8933
   io_ring_exit_work                    fs/io_uring.c:8736
   process_one_work                     kernel/workqueue.c:2276
   ...

With the config used, KCSAN only reports data races with value changes:
this implies that in the case here we also know that tctx->io_wq was
non-NULL. Therefore, depending on interleaving, we may end up with:

              [CPU 0]                 |        [CPU 1]
  io_uring_try_cancel_iowq()          | io_uring_clean_tctx()
    if (!tctx->io_wq) // false        |   ...
    ...                               |   tctx->io_wq = NULL
    io_wq_cancel_cb(tctx->io_wq, ...) |   ...
      -> NULL-deref                   |

Note: It is likely that thus far we've gotten lucky and the compiler
optimizes the double-read into a single read into a register -- but this
is never guaranteed, and can easily change with a different config!

Fix the data race by restoring the previous behaviour, where both
setting io_wq to NULL and put of the wq are _serialized_ after
concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock
and removal of the node in io_uring_del_task_file().

Fixes: ba5ef6dc8a ("io_uring: fortify tctx/io_wq cleanup")
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-27 07:44:49 -06:00
Huilong Deng a799b68a7c nfs: Remove trailing semicolon in macros
Macros should not use a trailing semicolon.

Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-27 09:19:33 -04:00
Zhang Xiaoxu e67afa7ee4 NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
Since commit bdcc2cd14e ("NFSv4.2: handle NFS-specific llseek errors"),
nfs42_proc_llseek would return -EOPNOTSUPP rather than -ENOTSUPP when
SEEK_DATA on NFSv4.0/v4.1.

This will lead xfstests generic/285 not run on NFSv4.0/v4.1 when set the
CONFIG_NFS_V4_2, rather than run failed.

Fixes: bdcc2cd14e ("NFSv4.2: handle NFS-specific llseek errors")
Cc: <stable.vger.kernel.org> # 4.2
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-27 08:46:19 -04:00
Gustavo A. R. Silva 53004ee78d xfs: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix
the following warnings by replacing /* fall through */ comments,
and its variants, with the new pseudo-keyword macro fallthrough:

fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]

Notice that Clang doesn't recognize /* fall through */ comments as
implicit fall-through markings, so in order to globally enable
-Wimplicit-fallthrough for Clang, these comments need to be
replaced with fallthrough; in the whole codebase.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-26 14:51:26 -05:00
Zqiang 3743c1723b io-wq: Fix UAF when wakeup wqe in hash waitqueue
BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650
Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802

Call Trace:
 __dump_stack [inline]
 dump_stack+0x141/0x1d7
 print_address_description.constprop.0.cold+0x5b/0x2c6
 __kasan_report [inline]
 kasan_report.cold+0x7c/0xd8
 __wake_up_common+0x637/0x650
 __wake_up_common_lock+0xd0/0x130
 io_worker_handle_work+0x9dd/0x1790
 io_wqe_worker+0xb2a/0xd40
 ret_from_fork+0x1f/0x30

Allocated by task 28798:
 kzalloc_node [inline]
 io_wq_create+0x3c4/0xdd0
 io_init_wq_offload [inline]
 io_uring_alloc_task_context+0x1bf/0x6b0
 __io_uring_add_task_file+0x29a/0x3c0
 io_uring_add_task_file [inline]
 io_uring_install_fd [inline]
 io_uring_create [inline]
 io_uring_setup+0x209a/0x2bd0
 do_syscall_64+0x3a/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 28798:
 kfree+0x106/0x2c0
 io_wq_destroy+0x182/0x380
 io_wq_put [inline]
 io_wq_put_and_exit+0x7a/0xa0
 io_uring_clean_tctx [inline]
 __io_uring_cancel+0x428/0x530
 io_uring_files_cancel
 do_exit+0x299/0x2a60
 do_group_exit+0x125/0x310
 get_signal+0x47f/0x2150
 arch_do_signal_or_restart+0x2a8/0x1eb0
 handle_signal_work[inline]
 exit_to_user_mode_loop [inline]
 exit_to_user_mode_prepare+0x171/0x280
 __syscall_exit_to_user_mode_work [inline]
 syscall_exit_to_user_mode+0x19/0x60
 do_syscall_64+0x47/0xb0
 entry_SYSCALL_64_after_hwframe

There are the following scenarios, hash waitqueue is shared by
io-wq1 and io-wq2. (note: wqe is worker)

io-wq1:worker2     | locks bit1
io-wq2:worker1     | waits bit1
io-wq1:worker3     | waits bit1

io-wq1:worker2     | completes all wqe bit1 work items
io-wq1:worker2     | drop bit1, exit

io-wq2:worker1     | locks bit1
io-wq1:worker3     | can not locks bit1, waits bit1 and exit
io-wq1             | exit and free io-wq1
io-wq2:worker1     | drops bit1
io-wq1:worker3     | be waked up, even though wqe is freed

After all iou-wrk belonging to io-wq1 have exited, remove wqe
form hash waitqueue, it is guaranteed that there will be no more
wqe belonging to io-wq1 in the hash waitqueue.

Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-26 09:03:56 -06:00
Colin Ian King 7d3848c03e fs: dlm: Fix spelling mistake "stucked" -> "stuck"
There are spelling mistake in log messages. Fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-26 09:49:04 -05:00
Colin Ian King f6089981d0 fs: dlm: Fix memory leak of object mh
There is an error return path that is not kfree'ing mh after
it has been successfully allocates.  Fix this by moving the
call to create_rcom to after the check on rc_in->rc_id check
to avoid this.

Thanks to Alexander Ahring Oder Aring for suggesting the
correct way to fix this.

Addresses-Coverity: ("Resource leak")
Fixes: a070a91cf1 ("fs: dlm: add more midcomms hooks")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-26 09:49:04 -05:00
Chao Yu 8939a8489c f2fs: atgc: export entries for better tunability via sysfs
This patch export below sysfs entries for better ATGC tunability.

/sys/fs/f2fs/<disk>/atgc_candidate_ratio
/sys/fs/f2fs/<disk>/atgc_candidate_count
/sys/fs/f2fs/<disk>/atgc_age_weight
/sys/fs/f2fs/<disk>/atgc_age_threshold

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Chao Yu 4a67d9b07a f2fs: compress: fix to disallow temp extension
This patch restricts to configure compress extension as format of:

 [filename + '.' + extension]

rather than:

 [filename + '.' + extension + (optional: '.' + temp extension)]

in order to avoid to enable compression incorrectly:

1. compress_extension=so
2. touch file.soa
3. touch file.so.tmp

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Jaegeuk Kim e3c548323d f2fs: let's allow compression for mmap files
This patch allows to compress mmap files. E.g., for so files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Chao Yu 0dd571785d f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
As marcosfrm reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=213089

Initramfs generators rely on "pre" softdeps (and "depends") to include
additional required modules.

F2FS does not declare "pre: crc32" softdep. Then every generator (dracut,
mkinitcpio...) has to maintain a hardcoded list for this purpose.

Hence let's use MODULE_SOFTDEP("pre: crc32") in f2fs code.

Fixes: 43b6573bac ("f2fs: use cryptoapi crc32 functions")
Reported-by: marcosfrm <marcosfrm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Tom Rix 4f55dc2a98 f2fs: return success if there is no work to do
Static analysis reports this problem
file.c:3206:2: warning: Undefined or garbage value returned to caller
        return err;
        ^~~~~~~~~~

err is only set if there is some work to do.  Because the loop returns
immediately on an error, if all the work was done, a 0 would be returned.
Instead of checking the unlikely case that there was no work to do,
change the return of err to 0.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:35 -07:00
Trond Myklebust 70536bf4eb NFS: Clean up reset of the mirror accounting variables
Now that nfs_pageio_do_add_request() resets the pg_count, we don't need
these other inlined resets.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26 06:36:13 -04:00
Trond Myklebust 0d0ea30935 NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
The value of mirror->pg_bytes_written should only be updated after a
successful attempt to flush out the requests on the list.

Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26 06:36:13 -04:00
Trond Myklebust 56517ab958 NFS: Fix an Oopsable condition in __nfs_pageio_add_request()
Ensure that nfs_pageio_error_cleanup() resets the mirror array contents,
so that the structure reflects the fact that it is now empty.
Also change the test in nfs_pageio_do_add_request() to be more robust by
checking whether or not the list is empty rather than relying on the
value of pg_count.

Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26 06:36:13 -04:00
Pavel Begunkov 17a91051fe io_uring/io-wq: close io-wq full-stop gap
There is an old problem with io-wq cancellation where requests should be
killed and are in io-wq but are not discoverable, e.g. in @next_hashed
or @linked vars of io_worker_handle_work(). It adds some unreliability
to individual request canellation, but also may potentially get
__io_uring_cancel() stuck. For instance:

1) An __io_uring_cancel()'s cancellation round have not found any
   request but there are some as desribed.
2) __io_uring_cancel() goes to sleep
3) Then workers wake up and try to execute those hidden requests
   that happen to be unbound.

As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
in advance, so preventing 3) from executing unbound requests. The
workers will initially break looping because of getting a signal as they
are threads of the dying/exec()'ing user task.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-25 19:39:58 -06:00
Dai Ngo f4e44b3933 NFSD: delay unmount source's export after inter-server copy completed.
Currently the source's export is mounted and unmounted on every
inter-server copy operation. This patch is an enhancement to delay
the unmount of the source export for a certain period of time to
eliminate the mount and unmount overhead on subsequent copy operations.

After a copy operation completes, a work entry is added to the
delayed unmount list with an expiration time. This list is serviced
by the laundromat thread to unmount the export of the expired entries.
Each time the export is being used again, its expiration time is
extended and the entry is re-inserted to the tail of the list.

The unmount task and the mount operation of the copy request are
synced to make sure the export is not unmounted while it's being
used.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25 17:06:51 -04:00
Olga Kornievskaia eac0b17a77 NFSD add vfs_fsync after async copy is done
Currently, the server does all copies as NFS_UNSTABLE. For synchronous
copies linux client will append a COMMIT to the COPY compound but for
async copies it does not (because COMMIT needs to be done after all
bytes are copied and not as a reply to the COPY operation).

However, in order to save the client doing a COMMIT as a separate
rpc, the server can reply back with NFS_FILE_SYNC copy. This patch
proposed to add vfs_fsync() call at the end of the async copy.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25 17:06:51 -04:00
J. Bruce Fields eeeadbb9bd nfsd: move some commit_metadata()s outside the inode lock
The commit may be time-consuming and there's no need to hold the lock
for it.

More of these are possible, these were just some easy ones.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25 17:06:51 -04:00
Yu Hsiang Huang e5d74a2d0e nfsd: Prevent truncation of an unlinked inode from blocking access to its directory
Truncation of an unlinked inode may take a long time for I/O waiting, and
it doesn't have to prevent access to the directory. Thus, let truncation
occur outside the directory's mutex, just like do_unlinkat() does.

Signed-off-by: Yu Hsiang Huang <nickhuang@synology.com>
Signed-off-by: Bing Jing Chang <bingjingc@synology.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-25 17:06:51 -04:00
Kees Cook bfb819ea20 proc: Check /proc/$pid/attr/ writes against file opener
Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.

[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-25 10:24:41 -10:00
Linus Torvalds ad9f25d338 netfslib fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCs8qYACgkQ+7dXa6fL
 C2s3QhAAkRSWoZQEmisGMcRKOhDJQh8Qc7lV4aKXFIa4EaLdeBPvhJhnJqMld2KY
 m35g4bSU/RsUjzSCLXVnEiHa9jdFKyK0C/XWshyidzrTDUk0HN6NXsBpp3ztWKlq
 iMOvQYnKWKoWr4seIdC1fAKSFcQ3uRlVnDnmm0GtB5ahu5ThNQtqf8nYMSuULZbo
 K9SybNUVCrDsORqDu2595gfK63MCOVn72Hj066s8owHrbD8Io52Kf6Q7jP1CkMGL
 x6Kl0pwjql6usUsaDEaqmNT3ck7UjlLp5h1EZnt/7SWbgInpNzk6BLP33DwCis+4
 rUpu+Zf8TEeOYDU5if8QpVszwsMyoKtkp9AjgjZkvxbedCqHkXJjxrnkk6/H7yJc
 4Zvi8sIU52D9PcZO0bD8zP/8eYm/ZTVjMjDt8PvIbTA583oGNWsfRBbvJYi1huxB
 i3G0PNVbqH0U3Z78XH4dmrkE1oMxbq2O5fg9ZNCuStxqD2vrZyyo/CcfidElLnCq
 vcT+obEI+NYFphMzk7rwSL4pH4OPwPziJfiudKANmUDei8rOejQ8nrw18CVF7neN
 Ewj1XiHOdi4JgGq92owpmCmTvle7GG9KNuCvfd4U67S9KOJAPT5UrSD696PrJJN7
 YpcBHJMqS9XLXwrGuKD7oDroxEEpvJEunRH+yt3YPa5OQtX3wIA=
 =poNo
 -----END PGP SIGNATURE-----

Merge tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfs fixes from David Howells:
 "A couple of fixes to the new netfs lib:

   - Pass the AOP flags through from netfs_write_begin() into
     grab_cache_page_write_begin().

   - Automatically enable in Kconfig netfs lib rather than presenting an
     option for manual enablement"

* tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual
  netfs: Pass flags through to grab_cache_page_write_begin()
2021-05-25 07:31:49 -10:00
Gustavo A. R. Silva b2db6c35ba afs: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple fallthrough pseudo-keywords in
places where the code is intended to fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/51150b54e0b0431a2c401cd54f2c4e7f50e94601.1605896059.git.gustavoars@kernel.org/ # v1
Link: https://lore.kernel.org/r/20210420211615.GA51432@embeddedor/ # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-25 07:30:34 -10:00
Alexander Aring 706474fbc5 fs: dlm: don't allow half transmitted messages
This patch will clean a dirty page buffer if a reconnect occurs. If a page
buffer was half transmitted we cannot start inside the middle of a dlm
message if a node connects again. I observed invalid length receptions
errors and was guessing that this behaviour occurs, after this patch I
never saw an invalid message length again. This patch might drops more
messages for dlm version 3.1 but 3.1 can't deal with half messages as
well, for 3.2 it might trigger more re-transmissions but will not leave dlm
in a broken state.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 5b2f981fde fs: dlm: add midcomms debugfs functionality
This patch adds functionality to debug midcomms per connection state
inside a comms directory which is similar like dlm configfs. Currently
there exists the possibility to read out two attributes which is the
send queue counter and the version of each midcomms node state.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 489d8e559c fs: dlm: add reliable connection if reconnect
This patch introduce to make a tcp lowcomms connection reliable even if
reconnects occurs. This is done by an application layer re-transmission
handling and sequence numbers in dlm protocols. There are three new dlm
commands:

DLM_OPTS:

This will encapsulate an existing dlm message (and rcom message if they
don't have an own application side re-transmission handling). As optional
handling additional tlv's (type length fields) can be appended. This can
be for example a sequence number field. However because in DLM_OPTS the
lockspace field is unused and a sequence number is a mandatory field it
isn't made as a tlv and we put the sequence number inside the lockspace
id. The possibility to add optional options are still there for future
purposes.

DLM_ACK:

Just a dlm header to acknowledge the receive of a DLM_OPTS message to
it's sender.

DLM_FIN:

This provides a 4 way handshake for connection termination inclusive
support for half-closed connections. It's provided on application layer
because SCTP doesn't support half-closed sockets, the shutdown() call
can interrupted by e.g. TCP resets itself and a hard logic to implement
it because the othercon paradigm in lowcomms. The 4-way termination
handshake also solve problems to synchronize peer EOF arrival and that
the cluster manager removes the peer in the node membership handling of
DLM. In some cases messages can be still transmitted in this time and we
need to wait for the node membership event.

To provide a reliable connection the node will retransmit all
unacknowledges message to it's peer on reconnect. The receiver will then
filtering out the next received message and drop all messages which are
duplicates.

As RCOM_STATUS and RCOM_NAMES messages are the first messages which are
exchanged and they have they own re-transmission handling, there exists
logic that these messages must be first. If these messages arrives we
store the dlm version field. This handling is on DLM 3.1 and after this
patch 3.2 the same. A backwards compatibility handling has been added
which seems to work on tests without tcpkill, however it's not recommended
to use DLM 3.1 and 3.2 at the same time, because DLM 3.2 tries to fix long
term bugs in the DLM protocol.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 8e2e40860c fs: dlm: add union in dlm header for lockspace id
This patch adds union inside the lockspace id to handle it also for
another use case for a different dlm command.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 37a247da51 fs: dlm: move out some hash functionality
This patch moves out some lowcomms hash functionality into lowcomms
header to provide them to other layers like midcomms as well.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 2874d1a68c fs: dlm: add functionality to re-transmit a message
This patch introduces a retransmit functionality for a lowcomms message
handle. It's just allocates a new buffer and transmit it again, no
special handling about prioritize it because keeping bytestream in order.

To avoid another connection look some refactor was done to make a new
buffer allocation with a preexisting connection pointer.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 8f2dc78dbc fs: dlm: make buffer handling per msg
This patch makes the void pointer handle for lowcomms functionality per
message and not per page allocation entry. A refcount handling for the
handle was added to keep the message alive until the user doesn't need
it anymore.

There exists now a per message callback which will be called when
allocating a new buffer. This callback will be guaranteed to be called
according the order of the sending buffer, which can be used that the
caller increments a sequence number for the dlm message handle.

For transition process we cast the dlm_mhandle to dlm_msg and vice versa
until the midcomms layer will implement a specific dlm_mhandle structure.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring a070a91cf1 fs: dlm: add more midcomms hooks
This patch prepares hooks to redirect to the midcomms layer which will
be used by the midcomms re-transmit handling.

There exists the new concept of stateless buffers allocation and
commits. This can be used to bypass the midcomms re-transmit handling. It
is used by RCOM_STATUS and RCOM_NAMES messages, because they have their
own ping-like re-transmit handling. As well these two messages will be
used to determine the DLM version per node, because these two messages
are per observation the first messages which are exchanged.

Cluster manager events for node membership are added to add support for
half-closed connections in cases that the peer connection get to
an end of file but DLM still holds membership of the node. In
this time DLM can still trigger new message which we should allow. After
the cluster manager node removal event occurs it safe to close the
connection.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 6fb5cf9d42 fs: dlm: public header in out utility
This patch allows to use header_out() and header_in() outside of dlm
util functionality.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 8aa31cbf20 fs: dlm: fix connection tcp EOF handling
This patch fixes the EOF handling for TCP that if and EOF is received we
will close the socket next time the writequeue runs empty. This is a
half-closed socket functionality which doesn't exists in SCTP. The
midcomms layer will do a half closed socket functionality on DLM side to
solve this problem for the SCTP case. However there is still the last ack
flying around but other reset functionality will take care of it if it got
lost.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring c6aa00e3d2 fs: dlm: cancel work sync othercon
These rx tx flags arguments are for signaling close_connection() from
which worker they are called. Obviously the receive worker cannot cancel
itself and vice versa for swork. For the othercon the receive worker
should only be used, however to avoid deadlocks we should pass the same
flags as the original close_connection() was called.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring ba868d9dea fs: dlm: reconnect if socket error report occurs
This patch will change the reconnect handling that if an error occurs
if a socket error callback is occurred. This will also handle reconnects
in a non blocking connecting case which is currently missing. If error
ECONNREFUSED is reported we delay the reconnect by one second.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 7443bc9625 fs: dlm: set is othercon flag
There is a is othercon flag which is never used, this patch will set it
and printout a warning if the othercon ever sends a dlm message which
should never be the case.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring b38bc9c2b3 fs: dlm: fix srcu read lock usage
This patch holds the srcu connection read lock in cases where we lookup
the connections and accessing it. We don't hold the srcu lock in workers
function where the scheduled worker is part of the connection itself.
The connection should not be freed if any worker is scheduled or
pending.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring 2df6b7627a fs: dlm: add dlm macros for ratelimit log
This patch add ratelimit macro to dlm subsystem and will set the
connecting log message to ratelimit. In non blocking connecting cases it
will print out this message a lot.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
Alexander Aring c937aabbd7 fs: dlm: always run complete for possible waiters
This patch changes the ping_members() result that we always run
complete() for possible waiters. We handle the -EINTR error code as
successful. This error code is returned if the recovery is stopped which
is likely that a new recovery is triggered with a new members
configuration and ping_members() runs again.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-05-25 09:22:20 -05:00
David Howells b71c791254 netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual
Make the netfs helper library selected automatically by the things that use
it rather than being manually configured, even though it's required[1].

Fixes: 3a5829fefd ("netfs: Make a netfs helper module")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/CAMuHMdXJZ7iNQE964CdBOU=vRKVMFzo=YF_eiwsGgqzuvZ+TuA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/162090298141.3166007.2971118149366779916.stgit@warthog.procyon.org.uk # v1
2021-05-25 13:48:04 +01:00
David Howells 19dee61381 netfs: Pass flags through to grab_cache_page_write_begin()
In netfs_write_begin(), pass the AOP flags through to
grab_cache_page_write_begin() so that a request to use GFP_NOFS is
honoured.

Fixes: e1b1240c1f ("netfs: Add write_begin helper")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/162090295383.3165945.13595101698295243662.stgit@warthog.procyon.org.uk # v1
2021-05-25 13:46:32 +01:00
Amir Goldstein a8b98c808e fanotify: fix permission model of unprivileged group
Reporting event->pid should depend on the privileges of the user that
initialized the group, not the privileges of the user reading the
events.

Use an internal group flag FANOTIFY_UNPRIV to record the fact that the
group was initialized by an unprivileged user.

To be on the safe side, the premissions to setup filesystem and mount
marks now require that both the user that initialized the group and
the user setting up the mark have CAP_SYS_ADMIN.

Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiA77_P5vtv7e83g0+9d7B5W9ZTE4GfQEYbWmfT1rA=VA@mail.gmail.com/
Fixes: 7cea2a3c50 ("fanotify: support limited functionality for unprivileged users")
Cc: <Stable@vger.kernel.org> # v5.12+
Link: https://lore.kernel.org/r/20210524135321.2190062-1-amir73il@gmail.com
Reviewed-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-05-25 12:21:14 +02:00
Bart Van Assche 7fe1e79b59 configfs: implement the .read_iter and .write_iter methods
Configfs is one of the few filesystems that does not yet support the
.read_iter and .write_iter callbacks. This patch adds support for these
methods in configfs.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[hch: split out a separate fix]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-25 10:30:56 +02:00
Christoph Hellwig 44b9a000df configfs: drop pointless kerneldoc comments
file.c has a bunch of kerneldoc comments for static functions that do not
document any API but just list what is done.  Drop them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-25 10:30:31 +02:00
Bart Van Assche dd33f1f7aa configfs: fix the kerneldoc comment for configfs_create_bin_file
Mention the correct argument name.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-25 10:30:31 +02:00
Darrick J. Wong 603f000b15 xfs: validate extsz hints against rt extent size when rtinherit is set
The RTINHERIT bit can be set on a directory so that newly created
regular files will have the REALTIME bit set to store their data on the
realtime volume.  If an extent size hint (and EXTSZINHERIT) are set on
the directory, the hint will also be copied into the new file.

As pointed out in previous patches, for realtime files we require the
extent size hint be an integer multiple of the realtime extent, but we
don't perform the same validation on a directory with both RTINHERIT and
EXTSZINHERIT set, even though the only use-case of that combination is
to propagate extent size hints into new realtime files.  This leads to
inode corruption errors when the bad values are propagated.

Because there may be existing filesystems with such a configuration, we
cannot simply amend the inode verifier to trip on these directories and
call it a day because that will cause previously "working" filesystems
to start throwing errors abruptly.  Note that it's valid to have
directories with rtinherit set even if there is no realtime volume, in
which case the problem does not manifest because rtinherit is ignored if
there's no realtime device; and it's possible that someone set the flag,
crashed, repaired the filesystem (which clears the hint on the realtime
file) and continued.

Therefore, mitigate this issue in several ways: First, if we try to
write out an inode with both rtinherit/extszinherit set and an unaligned
extent size hint, turn off the hint to correct the error.  Second, if
someone tries to misconfigure a directory via the fssetxattr ioctl, fail
the ioctl.  Third, reverify both extent size hint values when we
propagate heritable inode attributes from parent to child, to prevent
misconfigurations from spreading.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-24 18:01:04 -07:00
Darrick J. Wong 6b69e48589 xfs: standardize extent size hint validation
While chasing a bug involving invalid extent size hints being propagated
into newly created realtime files, I noticed that the xfs_ioctl_setattr
checks for the extent size hints weren't the same as the ones now
encoded in libxfs and used for validation in repair and mkfs.

Because the checks in libxfs are more stringent than the ones in the
ioctl, it's possible for a live system to set inode flags that
immediately result in corruption warnings.  Specifically, it's possible
to set an extent size hint on an rtinherit directory without checking if
the hint is aligned to the realtime extent size, which makes no sense
since that combination is used only to seed new realtime files.

Replace the open-coded and inadequate checks with the libxfs verifier
versions and update the code comments a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-24 18:01:04 -07:00
Darrick J. Wong 0f9342513c xfs: check free AG space when making per-AG reservations
The new online shrink code exposed a gap in the per-AG reservation
code, which is that we only return ENOSPC to callers if the entire fs
doesn't have enough free blocks.  Except for debugging mode, the
reservation init code doesn't ever check that there's enough free space
in that AG to cover the reservation.

Not having enough space is not considered an immediate fatal error that
requires filesystem offlining because (a) it's shouldn't be possible to
wind up in that state through normal file operations and (b) even if
one did, freeing data blocks would recover the situation.

However, online shrink now needs to know if shrinking would not leave
enough space so that it can abort the shrink operation.  Hence we need
to promote this assertion into an actual error return.

Observed by running xfs/168 with a 1k block size, though in theory this
could happen with any configuration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-05-24 18:01:04 -07:00
zhangyi (F) 12e0613715 block_dump: remove block_dump feature in mark_inode_dirty()
block_dump is an old debugging interface, one of it's functions is used
to print the information about who write which file on disk. If we
enable block_dump through /proc/sys/vm/block_dump and turn on debug log
level, we can gather information about write process name, target file
name and disk from kernel message. This feature is realized in
block_dump___mark_inode_dirty(), it print above information into kernel
message directly when marking inode dirty, so it is noisy and can easily
trigger log storm. At the same time, get the dentry refcount is also not
safe, we found it will lead to deadlock on ext4 file system with
data=journal mode.

After tracepoints has been introduced into the kernel, we got a
tracepoint in __mark_inode_dirty(), which is a better replacement of
block_dump___mark_inode_dirty(). The only downside is that it only trace
the inode number and not a file name, but it probably doesn't matter
because the original printed file name in block_dump is not accurate in
some cases, and we can still find it through the inode number and device
id. So this patch delete the dirting inode part of block_dump feature.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-2-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:21 -06:00
YueHaibing 21e4e15a84 reiserfs: Remove unneed check in reiserfs_write_full_page()
Condition !A || A && B is equivalent to !A || B.

Generated by: scripts/coccinelle/misc/excluded_middle.cocci

Link: https://lore.kernel.org/r/20210523090258.27696-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-05-24 11:00:56 +02:00
Mike Kravetz e32905e573 userfaultfd: hugetlbfs: fix new flag usage in error path
In commit d6995da311 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve.  Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.

Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation.  Specifically, this would be the
result of an unlikely copy_huge_page_from_user error.  There is not an
increased chance of hitting the VM_BUG_ON.

Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da311 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasry.mina@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Linus Torvalds 4ff2473bdb block-5.13-2021-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCpPO4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgI0EACitV5OwfX+saZdQEj3LF4dAo7uZkMV0cZK
 GJ3m1NWsMDXJJofcczyVTEs0iNT4fpb1dKE9cyOVjAFDoH8Dn7C+UZ163QWu+SCk
 WGgyiY+Qdwr7cyl6+2+WQkLBeLcyuFVjGtYHTxYWY2O+DpyhRw94Oiih1bfnI/6i
 KZTpaA3z+pZs/KFIE7eUnkI/iWC39VShZ1T8/gXO9vmIhUkA67j1o9i3LYpGYnXx
 Awza8Lpql7s3tfWcDL6FNHQmFPUjiowCSUNupzdnHgjggWwUCosJTTcL+mfdTHOJ
 YuYM3qRuzTbIeXXy/5JTZUt5AOkS8SCre7BpclSDrhZBiL/dkvAndN43ce/6vc7i
 FrgvnbY/Ik2PWQwcbxiXZzcEKxT9dzXbsyJG08ePZwQ5s+8M5KVZv+ElrV+T7/nJ
 DYjnWahQ674tHv2Z7Bp4hAjnchwiypxqie8OnOKBI+WseT2D8Pjs2sinUHSYKYDk
 3m2e0BVsw+FAYt3bcdhocDQnrJwMNrhSuA9Rtyh6qeMG34yxOXJmZvrHNrbg2fG/
 a/xgVewn/P4sDxGCwS3XH/zILYgvJAwTFWIfDeRXE4epqsPZ9h8FBq3Fzl5asL7V
 yl9iQlWuE1+Ks8IQMjunbJfQSTEghPCjJWHVQQVJm+rT33qI80Ac4a0vdd99TaXh
 8P58LE+0jg==
 =ADzj
 -----END PGP SIGNATURE-----

Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix BLKRRPART and deletion race (Gulam, Christoph)

 - NVMe pull request (Christoph):
      - nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith
        Busch)
      - nvme-fc teardown fix (James Smart)
      - nvmet/nvme-loop memory leak fixes (Wu Bo)"

* tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
  block: fix a race between del_gendisk and BLKRRPART
  block: prevent block device lookups at the beginning of del_gendisk
  nvme-fc: clear q_live at beginning of association teardown
  nvme-tcp: rerun io_work if req_list is not empty
  nvme-tcp: fix possible use-after-completion
  nvme-loop: fix memory leak in nvme_loop_create_ctrl()
  nvmet: fix memory leak in nvmet_alloc_ctrl()
2021-05-22 07:40:34 -10:00
Linus Torvalds b9231dfbcb io_uring-5.13-2021-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCpPQcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptBfD/0XV2+UXT7GTq3zfAgeCRojxzjH8YSh/fu1
 uSYA2fME0J7B2lpgxwHmDwgW/JkxkQ9oal+QxoNUJnmTF4CN3c7edQYxaA+QAnb/
 XiEY6s5slpBtopJCXQqPlE6dUnn1yjc0wNIm3EjvmmMaFjb6MVfMZWqWn9AANNTd
 yiRtk8a7KQuYdBeQMPVQG4v1ue37VTL5B9D9tM3p03W2ngNhtWw2Yxy5k/ePseip
 HYhPm+SKcbpmSFS+KN/a4aBLHyW89FRnhBWZF50sBmdUD+HLgz09IyFdSqyo9s2f
 wb7h3u3FbzTk3JofcJhYfqoQXkwmYhHrwNGCMjhK/zy+qloCIOu8Nw4jkcH+VYwK
 Rf7cFu+CZDRgcIu4Op/W5CPHNPY680Rxd/yBKlG/n4aZ/zxuuOu08992Z5BSaxfw
 UpIFMOWMuDvbBRUk71R34ME0o1wNhWL75Rljh97dAMRZLez1h8CmGdktT9g4keuo
 71Swq51AQk7fWXW0yQK2kIpbTjazfh6+AEvdF4c/Njss83K7PHCq00xeKI9PeNXN
 aQvPBpFifTeN1B1IENH5wEHO8F7e38eU45WHPwgNJUuSEpuBoXQGoLBlf6WXzNUS
 sIt6UjGDCFTZddIYwVfVISl7+DLBLCxYRYnw0Mx99x1shUpH+6q0HdpPgOxiKmoH
 ZgdG/q8rVg==
 =tipG
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "One fix for a regression with poll in this merge window, and another
  just hardens the io-wq exit path a bit"

* tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
  io_uring: fortify tctx/io_wq cleanup
  io_uring: don't modify req->poll for rw
2021-05-22 07:36:36 -10:00
Linus Torvalds a3969ef463 Fixes for 5.13-rc3:
- Fix some math errors in the realtime allocator when extent size hints
   are applied.
 - Fix unnecessary short writes to realtime files when free space is
   fragmented.
 - Fix a crash when using scrub tracepoints.
 - Restore ioctl uapi definitions that were accidentally removed in
   5.13-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmCmgSoACgkQ+H93GTRK
 tOvInhAAj9JbQLz/BPt14qVdteNDNBpdFdl/SC5Bow5ABOWi8FSOu9N/F32nMy9y
 fGPXP2G//sYzfW5jwE+ZPEEq7e+K55rLZHxHsbtWXjCL96t9edEDGFS5p6LmgnRW
 aJwE1QxhFHFJZEX1+Y08zmB+fSnwHJ4HzZihYb1f9sTI5cJRh7pvvj26HiqfUk9I
 FUT5Oo/Dy8gSSRmNCPWlLWhXJurrSwAYHmoE44vNHoEYHcodVwAK+ZR/Dj/5DQaS
 DTYgLeHZQmPijcW7/B/RKcEz96hMQ9afg7vBdgZwhG/oFCRKV2m3rYjjjo/AgKv4
 4FkmUoP5+CTT1vt9UKrdyl9uLTMqxiHuU1qvb82DM5XbbsLEFBBa+e3QrWmJqR8D
 3lBp6ogtOmbNEJpgLxCdbVl80HOjB+yaIWUB536nauz4USZvCcRGvdYDQ922q6ig
 1eT5Q6KCNgO3e6WIQ3W5kNJsM+/gXlNvwwhN/jHCQKy//bgVRnNYbODC6K46mjp/
 H8+NWyzQXpFjRWjmQ60LD2/RlyJVbbLbMkPTO1g/vjyZWjgp0fV2wtG6Ag/FlwVF
 DERUdp/0m6V+lPRcKbWCasjEc+pis6TDnvn+tCftXthh3fXGcalC3bnmv3yH9rkP
 YAejnDvuuLVtgPyVARA9DTQR5ch5CSRuFc6sU9SFL5nv2p59RtU=
 =IvUb
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix some math errors in the realtime allocator when extent size hints
   are applied.

 - Fix unnecessary short writes to realtime files when free space is
   fragmented.

 - Fix a crash when using scrub tracepoints.

 - Restore ioctl uapi definitions that were accidentally removed in
   5.13-rc1.

* tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: restore old ioctl definitions
  xfs: fix deadlock retry tracepoint arguments
  xfs: retry allocations when locality-based search fails
  xfs: adjust rt allocation minlen when extszhint > rtextsize
2021-05-21 18:45:09 -10:00
Linus Torvalds 45af60e7ce for-5.13-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCoEQkACgkQxWXV+ddt
 WDsn6Q//XXQVextL6g6Wjx0SR9b5C1ndSV841jNY+KQ0drBPSOBs+0SXI+nIWAK1
 iTpmj3s2qrRElZZ6DT4fKP28KnbUJed9+CcirNnN3IMOeauI760CLobXZLsw1wGH
 o0HKKgcPhw/v9o9jqX22rSfzDZ2Rx2KhZ8iEb1ZXIG5iJNFcnXCCoFOqk4I+UEvH
 /5734KU8RI3sCRhziSf/vDCF50p+BIWr8VilQkmZUzi0oa6Y1wXm0qd9j0unhICR
 NxcBk1NYdOosAvVRhSqync1BNLhXSctg4rwhLlSI5SDvt/Ivz5tguNr9HcizOvmW
 zyb0g1c3Pq0p2wQJLybbs1zn67d0+7Q23UPWx1C+IKU3nmX5mGWzToxjVOQASYaZ
 8UbzYAjUHtJpLDB4dp6+k5Pv/yfVGyhxXI+qLMWow77qRPPf7/vw5nEwTXmjcPRH
 9st0TopZVXI4IEpZP+HeNFdNONuPL3CqV0t1+MnC73WMhmUfXR5E8Yq5H3MscuFl
 smkrWUq/g+cmkiOw5r4MyadFuN1MsXGw4rOdbYjY4JqVht6gPkOp3P73Hme5rD3H
 Txw/1WKEl+w3I6wS0Dl/NFcMGOyl8gEv4rATDyRWkxfmCue2mcTGS/3jjjWWguu4
 +Q7e6p1390PLAvMV/rEDoYmFCoPSYp6trvupW+5fkZdOyei1SZM=
 =98LW
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes:

   - fix unaligned compressed writes in zoned mode

   - fix false positive lockdep warning when cloning inline extent

   - remove wrong BUG_ON in tree-log error handling"

* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix parallel compressed writes
  btrfs: zoned: pass start block to btrfs_use_zone_append
  btrfs: do not BUG_ON in link_to_fixup_dir
  btrfs: release path before starting transaction when cloning inline extent
2021-05-21 13:24:12 -10:00
Linus Torvalds 8bb14ca171 7 SMB3 fixes, one for stable, 3 others fix problems found in testing handle leases, and a compounded request fix
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmCn2x8ACgkQiiy9cAdy
 T1HjnQv+M87Xx++VVaJzeLQQlKGA/vfkhM7YLEkIwxmbUpt8JURORoK91xVa/RZA
 eS/K2tYOilAuuV7VXXw6ng6WNCWE/l+BNT5FHZ4WJt71pE1/tN/NIACtOhBB01GO
 r+JhAE08zYLu8vA1Ax1EBtSSBjTLUjDX0fWMfwD4C/BBABw5VZISnkSEj2lC6wT9
 vovEalU9amMRrvlhK9Z+MRJRJFzxY4LingiEVlFIdLczCGia5PgSl3NXRY1//rNO
 wc//34cCGxBNc5Su5Bvn1kTZT5mdBFR98mLOuD+Dw55LlIlShKDnhZHGQDGPyQGT
 ey2w2b+pNAr3rwVNtU6JNmI7AiUllNHiDu5UsyB0ctDWJljzrILd4uPaWofcNXAh
 5qPRvuGsqjo3D/10DPshla1pJtmFr8eKXy8o6UVfMYQSHDo1LbqMll7ArGgV3Fxn
 B2g5N+ax1+DXZlykKJGhYBBkvGANuUBU/tq810i5BvLhfrc1dx+pJlZAeO5OxCSA
 SBUiirq4
 =neWC
 -----END PGP SIGNATURE-----

Merge tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Seven smb3 fixes: one for stable, three others fix problems found in
  testing handle leases, and a compounded request fix"

* tag '5.13-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  Fix KASAN identified use-after-free issue.
  Defer close only when lease is enabled.
  Fix kernel oops when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
  cifs: Fix inconsistent indenting
  cifs: fix memory leak in smb2_copychunk_range
  SMB3: incorrect file id in requests compounded with open
  cifs: remove deadstore in cifs_close_all_deferred_files()
2021-05-21 13:12:51 -10:00
Greg Kroah-Hartman fb05b14c5b debugfs: remove return value of debugfs_create_ulong()
No one checks the return value of debugfs_create_ulong(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.

Link: https://lore.kernel.org/r/20210521184340.1348539-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-21 20:59:55 +02:00
Greg Kroah-Hartman 393b06383f debugfs: remove return value of debugfs_create_bool()
No one checks the return value of debugfs_create_bool(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.

Link: https://lore.kernel.org/r/20210521184519.1356639-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-21 20:59:03 +02:00
Linus Torvalds a0e31f3a38 Merge branch 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
 "During the merge window an issue with si_perf and the siginfo ABI came
  up. The alpha and sparc siginfo structure layout had changed with the
  addition of SIGTRAP TRAP_PERF and the new field si_perf.

  The reason only alpha and sparc were affected is that they are the
  only architectures that use si_trapno.

  Looking deeper it was discovered that si_trapno is used for only a few
  select signals on alpha and sparc, and that none of the other
  _sigfault fields past si_addr are used at all. Which means technically
  no regression on alpha and sparc.

  While the alignment concerns might be dismissed the abuse of si_errno
  by SIGTRAP TRAP_PERF does have the potential to cause regressions in
  existing userspace.

  While we still have time before userspace starts using and depending
  on the new definition siginfo for SIGTRAP TRAP_PERF this set of
  changes cleans up siginfo_t.

   - The si_trapno field is demoted from magic alpha and sparc status
     and made an ordinary union member of the _sigfault member of
     siginfo_t. Without moving it of course.

   - si_perf is replaced with si_perf_data and si_perf_type ending the
     abuse of si_errno.

   - Unnecessary additions to signalfd_siginfo are removed"

* 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
  signal: Deliver all of the siginfo perf data in _perf
  signal: Factor force_sig_perf out of perf_sigtrap
  signal: Implement SIL_FAULT_TRAPNO
  siginfo: Move si_trapno inside the union inside _si_fault
2021-05-21 06:12:52 -10:00
Huilong Deng cf1031ed47 jfs: Remove trailing semicolon in macros
Macros should not use a trailing semicolon.

Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2021-05-21 10:36:22 -05:00
zuoqilin 577ebd195f fs: Fix typo issue
Change 'inacitve' to 'inactive'.

Signed-off-by: zuoqilin <zuoqilin@yulong.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2021-05-21 10:36:00 -05:00
Phillip Potter a8867f4e38 ext4: fix memory leak in ext4_mb_init_backend on error path.
Fix a memory leak discovered by syzbot when a file system is corrupted
with an illegally large s_log_groups_per_flex.

Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-05-20 23:29:32 -04:00
Andreas Gruenbacher b7f55d928e gfs2: Fix mmap locking for write faults
When a write fault occurs, we need to take the inode glock of the underlying
inode in exclusive mode.  Otherwise, there's no guarantee that the dirty page
will be written back to disk.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-21 05:16:38 +02:00
Rohith Surabattula 9687c85dfb Fix KASAN identified use-after-free issue.
[  612.157429] ==================================================================
[  612.158275] BUG: KASAN: use-after-free in process_one_work+0x90/0x9b0
[  612.158801] Read of size 8 at addr ffff88810a31ca60 by task kworker/2:9/2382

[  612.159611] CPU: 2 PID: 2382 Comm: kworker/2:9 Tainted: G
OE     5.13.0-rc2+ #98
[  612.159623] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-1.fc33 04/01/2014
[  612.159640] Workqueue:  0x0 (deferredclose)
[  612.159669] Call Trace:
[  612.159685]  dump_stack+0xbb/0x107
[  612.159711]  print_address_description.constprop.0+0x18/0x140
[  612.159733]  ? process_one_work+0x90/0x9b0
[  612.159743]  ? process_one_work+0x90/0x9b0
[  612.159754]  kasan_report.cold+0x7c/0xd8
[  612.159778]  ? lock_is_held_type+0x80/0x130
[  612.159789]  ? process_one_work+0x90/0x9b0
[  612.159812]  kasan_check_range+0x145/0x1a0
[  612.159834]  process_one_work+0x90/0x9b0
[  612.159877]  ? pwq_dec_nr_in_flight+0x110/0x110
[  612.159914]  ? spin_bug+0x90/0x90
[  612.159967]  worker_thread+0x3b6/0x6c0
[  612.160023]  ? process_one_work+0x9b0/0x9b0
[  612.160038]  kthread+0x1dc/0x200
[  612.160051]  ? kthread_create_worker_on_cpu+0xd0/0xd0
[  612.160092]  ret_from_fork+0x1f/0x30

[  612.160399] Allocated by task 2358:
[  612.160757]  kasan_save_stack+0x1b/0x40
[  612.160768]  __kasan_kmalloc+0x9b/0xd0
[  612.160778]  cifs_new_fileinfo+0xb0/0x960 [cifs]
[  612.161170]  cifs_open+0xadf/0xf20 [cifs]
[  612.161421]  do_dentry_open+0x2aa/0x6b0
[  612.161432]  path_openat+0xbd9/0xfa0
[  612.161441]  do_filp_open+0x11d/0x230
[  612.161450]  do_sys_openat2+0x115/0x240
[  612.161460]  __x64_sys_openat+0xce/0x140

When mod_delayed_work is called to modify the delay of pending work,
it might return false and queue a new work when pending work is
already scheduled or when try to grab pending work failed.

So, Increase the reference count when new work is scheduled to
avoid use-after-free.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-20 12:20:42 -05:00
Linus Torvalds 50f09a3dd5 Char/misc driver fixes for 5.13-rc3
Here is a big set of char/misc/other driver fixes for 5.13-rc3.
 
 The majority here is the fallout of the umn.edu re-review of all prior
 submissions.  That resulted in a bunch of reverts along with the
 "correct" changes made, such that there is no regression of any of the
 potential fixes that were made by those individuals.  I would like to
 thank the over 80 different developers who helped with the review and
 fixes for this mess.
 
 Other than that, there's a few habanna driver fixes for reported issues,
 and some dyndbg fixes for reported problems.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYKZCBg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynhRQCdGk6ri4oluyn/Z/2KAjvXDOmTmvgAn12VP42d
 S1Zmh4qRH2OWaLOBg7c2
 =qtxj
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here is a big set of char/misc/other driver fixes for 5.13-rc3.

  The majority here is the fallout of the umn.edu re-review of all prior
  submissions. That resulted in a bunch of reverts along with the
  "correct" changes made, such that there is no regression of any of the
  potential fixes that were made by those individuals. I would like to
  thank the over 80 different developers who helped with the review and
  fixes for this mess.

  Other than that, there's a few habanna driver fixes for reported
  issues, and some dyndbg fixes for reported problems.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'char-misc-5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (82 commits)
  misc: eeprom: at24: check suspend status before disable regulator
  uio_hv_generic: Fix another memory leak in error handling paths
  uio_hv_generic: Fix a memory leak in error handling paths
  uio/uio_pci_generic: fix return value changed in refactoring
  Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference""
  dyndbg: drop uninformative vpr_info
  dyndbg: avoid calling dyndbg_emit_prefix when it has no work
  binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION
  cdrom: gdrom: initialize global variable at init time
  brcmfmac: properly check for bus register errors
  Revert "brcmfmac: add a check for the status of usb_register"
  video: imsttfb: check for ioremap() failures
  Revert "video: imsttfb: fix potential NULL pointer dereferences"
  net: liquidio: Add missing null pointer checks
  Revert "net: liquidio: fix a NULL pointer dereference"
  media: gspca: properly check for errors in po1030_probe()
  Revert "media: gspca: Check the return value of write_bridge for timeout"
  media: gspca: mt9m111: Check write_bridge for timeout
  Revert "media: gspca: mt9m111: Check write_bridge for timeout"
  media: dvb: Add check on sp8870_readreg return
  ...
2021-05-20 06:31:52 -10:00
Linus Torvalds 7ac177143c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCmN9AACgkQnJ2qBz9k
 QNn5ZwgAwnLdgBuILDqJwPaYpXOzvMhjjG8AwBDzhMYhhpt+OOCUevoRm7mDU7J2
 t/DlwWGMhpp80ku+x+AURR/ltOfFvw4QAHeIXPWjkoieFKcLOEvAjWWZP6oIFC12
 5e/QVXqK58fuRJwveYp4jZ+AXvDMoHJrDXsoTFezjBDIQQgzlIlrMzPavS/6UzUN
 mAF2sapE9lcQoRMfU8kktBWPVM/GpFkus2Q48EYFCZ1rp3aRyw/aahTVuvSUZCV0
 XiY6f2F7qgFLtomK6UurlxTc7rPsrG+UmNvGWuXf3R81UawegmKQeG5zcaMGrZs1
 kHyJQcP9nGYPLDXt/4kW9cY0s8oOKg==
 =RbOE
 -----END PGP SIGNATURE-----

Merge tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull quota fixes from Jan Kara:
 "The most important part in the pull is disablement of the new syscall
  quotactl_path() which was added in rc1.

  The reason is some people at LWN discussion pointed out dirfd would be
  useful for this path based syscall and Christian Brauner agreed.

  Without dirfd it may be indeed problematic for containers. So let's
  just disable the syscall for now when it doesn't have users yet so
  that we have more time to mull over how to best specify the filesystem
  we want to work on"

* tag 'quota_for_v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Disable quotactl_path syscall
  quota: Use 'hlist_for_each_entry' to simplify code
2021-05-20 06:20:15 -10:00
Anna Schumaker a421d21860 NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
Commit de144ff423 changes _pnfs_return_layout() to call
pnfs_mark_matching_lsegs_return() passing NULL as the struct
pnfs_layout_range argument. Unfortunately,
pnfs_mark_matching_lsegs_return() doesn't check if we have a value here
before dereferencing it, causing an oops.

I'm able to hit this crash consistently when running connectathon basic
tests on NFS v4.1/v4.2 against Ontap.

Fixes: de144ff423 ("NFSv4: Don't discard segments marked for return in _pnfs_return_layout()")
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20 12:17:08 -04:00
Yang Li d1d973950a pNFS/NFSv4: Remove redundant initialization of 'rd_size'
Variable 'rd_size' is being initialized however
this value is never read as 'rd_size' is assigned
a new value in for statement. Remove the redundant
assignment.

Clean up clang warning:

fs/nfs/pnfs.c:2681:6: warning: Value stored to 'rd_size' during its
initialization is never read [clang-analyzer-deadcode.DeadStores]

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20 12:17:08 -04:00
Dan Carpenter 769b01ea68 NFS: fix an incorrect limit in filelayout_decode_layout()
The "sizeof(struct nfs_fh)" is two bytes too large and could lead to
memory corruption.  It should be NFS_MAXFHSIZE because that's the size
of the ->data[] buffer.

I reversed the size of the arguments to put the variable on the left.

Fixes: 16b374ca43 ("NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20 12:17:08 -04:00
zhouchuangao bb00238890 fs/nfs: Use fatal_signal_pending instead of signal_pending
We set the state of the current process to TASK_KILLABLE via
prepare_to_wait(). Should we use fatal_signal_pending() to detect
the signal here?

Fixes: b4868b44c5 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE")
Signed-off-by: zhouchuangao <zhouchuangao@vivo.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-20 12:15:35 -04:00
Darrick J. Wong e3c2b04747 xfs: restore old ioctl definitions
These ioctl definitions in xfs_fs.h are part of the userspace ABI and
were mistakenly removed during the 5.13 merge window.

Fixes: 9fefd5db08 ("xfs: convert to fileattr")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-20 08:31:22 -07:00
Darrick J. Wong 16c9de54dc xfs: fix deadlock retry tracepoint arguments
sc->ip is the inode that's being scrubbed, which means that it's not set
for scrub types that don't involve inodes.  If one of those scrubbers
(e.g. inode btrees) returns EDEADLOCK, we'll trip over the null pointer.
Fix that by reporting either the file being examined or the file that
was used to call scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-20 08:31:22 -07:00
Darrick J. Wong 676a659b60 xfs: retry allocations when locality-based search fails
If a realtime allocation fails because we can't find a sufficiently
large free extent satisfying locality rules, relax the locality rules
and try again.  This reduces the occurrence of short writes to realtime
files when the write size is large and the free space is fragmented.

This was originally discovered by running generic/186 with the realtime
reflink patchset and a 128k cow extent size hint, but the short write
symptoms can manifest with a 128k extent size hint and no reflink, so
apply the fix now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-05-20 08:28:34 -07:00
Gulam Mohamed bc6a385132 block: fix a race between del_gendisk and BLKRRPART
When BLKRRPART is called concurrently with del_gendisk, the partitions
rescan can create a stale partition that will never be be cleaned up.

Fix this by checking the the disk is up before rescanning partitions
while under bd_mutex.

Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-20 07:59:35 -06:00
Christoph Hellwig 6c60ff048c block: prevent block device lookups at the beginning of del_gendisk
As an artifact of how gendisk lookup used to work in earlier kernels,
GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock
is used to prevent opens from succeeding while del_gendisk is tearing
down the gendisk.  Switch to clearing the flag early and under bd_mutex
so that callers can use bd_mutex to stabilize the flag, which removes
the need for the global mutex.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-20 07:59:35 -06:00
Johannes Thumshirn 764c7c9a46 btrfs: zoned: fix parallel compressed writes
When multiple processes write data to the same block group on a
compressed zoned filesystem, the underlying device could report I/O
errors and data corruption is possible.

This happens because on a zoned file system, compressed data writes
where sent to the device via a REQ_OP_WRITE instead of a
REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel
submission it cannot be guaranteed that the data is always submitted
aligned to the underlying zone's write pointer.

The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
zoned filesystem is non intrusive on a regular file system or when
submitting to a conventional zone on a zoned filesystem, as it is
guarded by btrfs_use_zone_append.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 9d294a685f ("btrfs: zoned: enable to mount ZONED incompat flag")
CC: stable@vger.kernel.org # 5.12.x: e380adfc213a13: btrfs: zoned: pass start block to btrfs_use_zone_append
CC: stable@vger.kernel.org # 5.12.x
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-20 15:51:07 +02:00
Johannes Thumshirn e380adfc21 btrfs: zoned: pass start block to btrfs_use_zone_append
btrfs_use_zone_append only needs the passed in extent_map's block_start
member, so there's no need to pass in the full extent map.

This also enables the use of btrfs_use_zone_append in places where we only
have a start byte but no extent_map.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-20 15:50:49 +02:00
Pavel Begunkov ba5ef6dc8a io_uring: fortify tctx/io_wq cleanup
We don't want anyone poking into tctx->io_wq awhile it's being destroyed
by io_wq_put_and_exit(), and even though it shouldn't even happen, if
buggy would be preferable to get a NULL-deref instead of subtle delayed
failure or UAF.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-20 07:29:11 -06:00
Bob Peterson f5456b5d67 gfs2: Clean up revokes on normal withdraws
Before this patch, the system ail lists were cleaned up if the logd
process withdrew, but on other withdraws, they were not cleaned up.
This included the cleaning up of the revokes as well.

This patch reorganizes things a bit so that all withdraws (not just logd)
clean up the ail lists, including any pending revokes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-20 13:31:37 +02:00
Bob Peterson 865cc3e9cc gfs2: fix a deadlock on withdraw-during-mount
Before this patch, gfs2 would deadlock because of the following
sequence during mount:

mount
   gfs2_fill_super
      gfs2_make_fs_rw <--- Detects IO error with glock
         kthread_stop(sdp->sd_quotad_process);
            <--- Blocked waiting for quotad to finish

logd
   Detects IO error and the need to withdraw
   calls gfs2_withdraw
      gfs2_make_fs_ro
         kthread_stop(sdp->sd_quotad_process);
            <--- Blocked waiting for quotad to finish

gfs2_quotad
   gfs2_statfs_sync
      gfs2_glock_wait <---- Blocked waiting for statfs glock to be granted

glock_work_func
   do_xmote <---Detects IO error, can't release glock: blocked on withdraw
      glops->go_inval
      glock_blocked_by_withdraw
         requeue glock work & exit <--- work requeued, blocked by withdraw

This patch makes a special exception for the statfs system inode glock,
which allows the statfs glock UNLOCK to proceed normally. That allows the
quotad daemon to exit during the withdraw, which allows the logd daemon
to exit during the withdraw, which allows the mount to exit.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-20 13:31:37 +02:00
Bob Peterson 20265d9a67 gfs2: fix scheduling while atomic bug in glocks
Before this patch, in the unlikely event that gfs2_glock_dq encountered
a withdraw, it would do a wait_on_bit to wait for its journal to be
recovered, but it never released the glock's spin_lock, which caused a
scheduling-while-atomic error.

This patch unlocks the lockref spin_lock before waiting for recovery.

Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-20 13:31:37 +02:00
Bob Peterson 4194dec4b4 gfs2: Fix I_NEW check in gfs2_dinode_in
Patch 4a378d8a0d added a new check for I_NEW inodes, but unfortunately
it used the wrong variable, i_flags. This caused GFS2 to withdraw when
gfs2_lookup_by_inum needed to refresh an I_NEW inode. This patch switches
to use the correct variable, i_state.

Fixes: 4a378d8a0d ("gfs2: be careful with inode refresh")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-20 13:31:37 +02:00
Andreas Gruenbacher 43a511c44e gfs2: Prevent direct-I/O write fallback errors from getting lost
When a direct I/O write falls entirely and falls back to buffered I/O and the
buffered I/O fails, the write failed with return value 0 instead of the error
number reported by the buffered I/O. Fix that.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-05-20 13:31:36 +02:00
Arturo Giusti fa236c2b2d udf: Fix NULL pointer dereference in udf_symlink function
In function udf_symlink, epos.bh is assigned with the value returned
by udf_tgetblk. The function udf_tgetblk is defined in udf/misc.c
and returns the value of sb_getblk function that could be NULL.
Then, epos.bh is used without any check, causing a possible
NULL pointer dereference when sb_getblk fails.

This fix adds a check to validate the value of epos.bh.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=213083
Signed-off-by: Arturo Giusti <koredump@protonmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-05-20 12:14:44 +02:00
Rohith Surabattula 0ab95c2510 Defer close only when lease is enabled.
When smb2 lease parameter is disabled on server. Server grants
batch oplock instead of RHW lease by default on open, inode page cache
needs to be zapped immediatley upon close as cache is not valid.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-19 21:11:28 -05:00
Rohith Surabattula 860b69a9d7 Fix kernel oops when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
Removed oplock_break_received flag which was added to achieve
synchronization between oplock handler and open handler by earlier commit.

It is not needed because there is an existing lock open_file_lock to achieve
the same. find_readable_file takes open_file_lock and then traverses the
openFileList. Similarly, cifs_oplock_break while closing the deferred
handle (i.e cifsFileInfo_put) takes open_file_lock and then sends close
to the server.

Added comments for better readability.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-19 21:11:26 -05:00
Jiapeng Chong e83aa3528a cifs: Fix inconsistent indenting
Eliminate the follow smatch warning:

fs/cifs/fs_context.c:1148 smb3_fs_context_parse_param() warn:
inconsistent indenting.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-19 21:11:09 -05:00
Ronnie Sahlberg d201d7631c cifs: fix memory leak in smb2_copychunk_range
When using smb2_copychunk_range() for large ranges we will
run through several iterations of a loop calling SMB2_ioctl()
but never actually free the returned buffer except for the final
iteration.
This leads to memory leaks everytime a large copychunk is requested.

Fixes: 9bf0c9cd43 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files")
Cc: <stable@vger.kernel.org>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-19 19:19:20 -05:00
Linus Torvalds c3d0e3fd41 fs.idmapped.mount_setattr.v5.13-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYKT8wgAKCRCRxhvAZXjc
 op2mAP9hyc3sp2/HvEuTYDc6LmljPNCqdKeCP1eiX5SZR0yMVwEAwR5xym7YeqYZ
 LRj+xjuvOmrSNhcxpZqLpXuPOcY78wM=
 =E4a/
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux

Pull mount_setattr fix from Christian Brauner:
 "This makes an underlying idmapping assumption more explicit.

  We currently don't have any filesystems that support idmapped mounts
  which are mountable inside a user namespace, i.e. where s_user_ns !=
  init_user_ns. That was a deliberate decision for now as userns root
  can just mount the filesystem themselves.

  Express this restriction explicitly and enforce it until there's a
  real use-case for this. This way we can notice it and will have a
  chance to adapt and audit our translation helpers and fstests
  appropriately if we need to support such filesystems"

* tag 'fs.idmapped.mount_setattr.v5.13-rc3' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
  fs/mount_setattr: tighten permission checks
2021-05-19 06:12:31 -10:00
Steve French c0d46717b9 SMB3: incorrect file id in requests compounded with open
See MS-SMB2 3.2.4.1.4, file ids in compounded requests should be set to
0xFFFFFFFFFFFFFFFF (we were treating it as u32 not u64 and setting
it incorrectly).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
2021-05-19 10:10:58 -05:00
Al Viro e4b2755318 getcwd(2): clean up error handling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:15:58 -04:00
Al Viro cf4febc1ad d_path: prepend_path() is unlikely to return non-zero
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:15:58 -04:00
Al Viro 008673ff74 d_path: prepend_path(): lift the inner loop into a new helper
... and leave the rename_lock/mount_lock handling in prepend_path()
itself

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:15:57 -04:00
Al Viro 2dac0ad175 d_path: prepend_path(): lift resetting b in case when we'd return 3 out of loop
preparation to extracting the loop into helper (next commit)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:15:56 -04:00
Al Viro 7c0d552fd5 d_path: prepend_path(): get rid of vfsmnt
it's kept equal to &mnt->mnt all along.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:15:56 -04:00
Al Viro ad08ae5865 d_path: introduce struct prepend_buffer
We've a lot of places where we have pairs of form (pointer to end
of buffer, amount of space left in front of that).  These sit in pairs of
variables located next to each other and usually passed by reference.
Turn those into instances of new type (struct prepend_buffer) and pass
reference to the pair instead of pairs of references to its fields.

Declared and initialized by DECLARE_BUFFER(name, buf, buflen).

extract_string(prepend_buffer) returns the buffer contents if
no overflow has happened, ERR_PTR(ENAMETOOLONG) otherwise.
All places where we used to have that boilerplate converted to use
of that helper.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:14:53 -04:00
Al Viro 95b55c42f6 d_path: make prepend_name() boolean
It returns only 0 or -ENAMETOOLONG and both callers only check if
the result is negative.  Might as well return true on success and
false on failure...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:08:12 -04:00
Al Viro 01a4428ee7 d_path: lift -ENAMETOOLONG handling into callers of prepend_path()
The only negative value ever returned by prepend_path() is -ENAMETOOLONG
and callers can recognize that situation (overflow) by looking at the
sign of buflen.  Lift that into the callers; we already have the
same logics (buf if buflen is non-negative, ERR_PTR(-ENAMETOOLONG) otherwise)
in several places and that'll become a new primitive several commits down
the road.

Make prepend_path() return 0 instead of -ENAMETOOLONG.  That makes for
saner calling conventions (0/1/2/3/-ENAMETOOLONG is obnoxious) and
callers actually get simpler, especially once the aforementioned
primitive gets added.

In prepend_path() itself we switch prepending the / (in case of
empty path) to use of prepend() - no need to open-code that, compiler
will do the right thing.  It's exactly the same logics as in
__dentry_path().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:08:12 -04:00
Al Viro d8548232ea d_path: don't bother with return value of prepend()
Only simple_dname() checks it, and there we can simply do those
calls and check for overflow (by looking of negative buflen)
in the end.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:08:11 -04:00
Al Viro a0378fb9b3 getcwd(2): saner logics around prepend_path() call
The only negative value that might get returned by prepend_path() is
-ENAMETOOLONG, and that happens only on overflow.  The same goes for
prepend_unreachable().  Overflow is detectable by observing negative
buflen, so we can simplify the control flow around the prepend_path()
call.  Expand prepend_unreachable(), while we are at it - that's the
only caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:08:11 -04:00
Al Viro 9024348f53 d_path: get rid of path_with_deleted()
expand in the sole caller; transform the initial prepends similar to
what we'd done in dentry_path() (prepend_path() will fail the right
way if we call it with negative buflen, same as __dentry_path() does).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:08:10 -04:00
Al Viro 3acca04326 d_path: regularize handling of root dentry in __dentry_path()
All path-forming primitives boil down to sequence of prepend_name()
on dentries encountered along the way toward root.  Each time we prepend
/ + dentry name to the buffer.  Normally that does exactly what we want,
but there's a corner case when we don't call prepend_name() at all (in case
of __dentry_path() that happens if we are given root dentry).  We obviously
want to end up with "/", rather than "", so this corner case needs to be
handled.

__dentry_path() used to manually put '/' in the end of buffer before
doing anything else, to be overwritten by the first call of prepend_name()
if one happens and to be left in place if we don't call prepend_name() at
all.  That required manually checking that we had space in the buffer
(prepend_name() and prepend() take care of such checks themselves) and lead
to clumsy keeping track of return value.

A better approach is to check if the main loop has added anything
into the buffer and prepend "/" if it hasn't.  A side benefit of using prepend()
is that it does the right thing if we'd already run out of buffer, making
the overflow-handling logics simpler.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-05-18 20:06:36 -04:00
Eric W. Biederman 922e301304 signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
With the addition of ssi_perf_data and ssi_perf_type struct signalfd_siginfo
is dangerously close to running out of space.  All that remains is just
enough space for two additional 64bit fields.  A practice of adding all
possible siginfo_t fields into struct singalfd_siginfo can not be supported
as adding the missing fields ssi_lower, ssi_upper, and ssi_pkey would
require two 64bit fields and one 32bit fields.  In practice the fields
ssi_perf_data and ssi_perf_type can never be used by signalfd as the signal
that generates them always delivers them synchronously to the thread that
triggers them.

Therefore until someone actually needs the fields ssi_perf_data and
ssi_perf_type in signalfd_siginfo remove them.  This leaves a bit more room
for future expansion.

v1: https://lkml.kernel.org/r/20210503203814.25487-12-ebiederm@xmission.com
v2: https://lkml.kernel.org/r/20210505141101.11519-12-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-5-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:54 -05:00
Eric W. Biederman 0683b53197 signal: Deliver all of the siginfo perf data in _perf
Don't abuse si_errno and deliver all of the perf data in _perf member
of siginfo_t.

Note: The data field in the perf data structures in a u64 to allow a
pointer to be encoded without needed to implement a 32bit and 64bit
version of the same structure.  There already exists a 32bit and 64bit
versions siginfo_t, and the 32bit version can not include a 64bit
member as it only has 32bit alignment.  So unsigned long is used in
siginfo_t instead of a u64 as unsigned long can encode a pointer on
all architectures linux supports.

v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com
v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:54 -05:00
Eric W. Biederman 9abcabe311 signal: Implement SIL_FAULT_TRAPNO
Now that si_trapno is part of the union in _si_fault and available on
all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to
return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid.

There is room for future changes to reduce when si_trapno is valid but
this is all that is needed to make si_trapno and the other members of
the the union in _sigfault mutually exclusive.

Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO
and have the same code ignore si_trapno in in all other cases.

v1: https://lkml.kernel.org/r/m1o8dvs7s7.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-6-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-2-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:34 -05:00
Chuck Lever d6cbe98ff3 NFSD: Update nfsd_cb_args tracepoint
Clean-up: Re-order the display of IP address and client ID to be
consistent with other _cb_ tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 1d2bf65983 NFSD: Remove the nfsd_cb_work and nfsd_cb_done tracepoints
Clean up: These are noise in properly working systems. If you really
need to observe the operation of the callback mechanism, use the
sunrpc:rpc\* tracepoints along with the workqueue tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 4ade892ae1 NFSD: Add an nfsd_cb_probe tracepoint
Record a tracepoint event when the server performs a callback
probe. This event can be enabled as a group with other nfsd_cb
tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 17d76ddf76 NFSD: Replace the nfsd_deleg_break tracepoint
Renamed so it can be enabled as a set with the other nfsd_cb_
tracepoints. And, consistent with those tracepoints, report the
address of the client, the client ID the server has given it, and
the state ID being recalled.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 87512386e9 NFSD: Add an nfsd_cb_offload tracepoint
Record the arguments of CB_OFFLOAD callbacks so we can better
observe asynchronous copy-offload behavior. For example:

nfsd-995   [008]  7721.934222: nfsd_cb_offload:
        addr=192.168.2.51:0 client 6092a47c:35a43fc1 fh_hash=0x8739113a
        count=116528 status=0

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 2cde7f8118 NFSD: Add an nfsd_cb_lm_notify tracepoint
When the server kicks off a CB_LM_NOTIFY callback, record its
arguments so we can better observe asynchronous locking behavior.
For example:

            nfsd-998   [002]  1471.705873: nfsd_cb_notify_lock:  addr=192.168.2.51:0 client 6092a47c:35a43fc1 fh_hash=0x8950b23a

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 3c92fba557 NFSD: Enhance the nfsd_cb_setup tracepoint
Display the transport protocol and authentication flavor so admins
can see what they might be getting wrong.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 9f57c6062b NFSD: Remove spurious cb_setup_err tracepoint
This path is not really an error path, so the tracepoint I added
there is just noise.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever b200f0e353 NFSD: Adjust cb_shutdown tracepoint
Show when the upper layer requested a shutdown. RPC tracepoints can
already show when rpc_shutdown_client() is called.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 806d65b617 NFSD: Add cb_lost tracepoint
Provide more clarity about when the callback channel is in trouble.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 167145cc64 NFSD: Drop TRACE_DEFINE_ENUM for NFSD4_CB_<state> macros
TRACE_DEFINE_ENUM() is necessary for enum {} but not for C macros.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:04 -04:00
Chuck Lever 8476c69a7f NFSD: Capture every CB state transition
We were missing one.

As a clean-up, add a helper that sets the new CB state and fires
a tracepoint. The tracepoint fires only when the state changes, to
help reduce trace log noise.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:03 -04:00
Chuck Lever 1736aec82a NFSD: Constify @fh argument of knfsd_fh_hash()
Enable knfsd_fh_hash() to be invoked in functions where the
filehandle pointer is a const.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:03 -04:00
Chuck Lever e8f80c5545 NFSD: Add tracepoints for EXCHANGEID edge cases
Some of the most common cases are traced. Enough infrastructure is
now in place that more can be added later, as needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:03 -04:00
Chuck Lever 237f91c85a NFSD: Add tracepoints for SETCLIENTID edge cases
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-05-18 13:44:03 -04:00