Граф коммитов

62832 Коммитов

Автор SHA1 Сообщение Дата
Frank Sorenson f52aa79df4 cifs: Fix mode output in debugging statements
A number of the debug statements output file or directory mode
in hex.  Change these to print using octal.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-12 22:24:26 -06:00
Jens Axboe 7563439adf io-wq: don't call kXalloc_node() with non-online node
Glauber reports a crash on init on a box he has:

 RIP: 0010:__alloc_pages_nodemask+0x132/0x340
 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
 FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  alloc_slab_page+0x46/0x320
  new_slab+0x9d/0x4e0
  ___slab_alloc+0x507/0x6a0
  ? io_wq_create+0xb4/0x2a0
  __slab_alloc+0x1c/0x30
  kmem_cache_alloc_node_trace+0xa6/0x260
  io_wq_create+0xb4/0x2a0
  io_uring_setup+0x97f/0xaa0
  ? io_remove_personalities+0x30/0x30
  ? io_poll_trigger_evfd+0x30/0x30
  do_syscall_64+0x5b/0x1c0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f4d116cb1ed

which is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.

Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.

Reported-by: Glauber Costa <glauber@scylladb.com>
Tested-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-12 17:43:22 -07:00
Trond Myklebust efeda80da3 NFSv4: Fix revalidation of dentries with delegations
If a dentry was not initially looked up while we were holding a
delegation, then we do still need to revalidate that it still holds
the same name. If there are multiple hard links to the same file,
then all the hard links need validation.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
[Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-12 13:55:25 -05:00
Anand Jain 1b9867eb61 btrfs: sysfs, move device id directories to UUID/devinfo
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.

Fixes: 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 18:28:22 +01:00
Anand Jain a013d141ec btrfs: sysfs, add UUID/devinfo kobject
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 18:28:18 +01:00
Filipe Manana 28553fa992 Btrfs: fix race between shrinking truncate and fiemap
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).

Consider the following example:

1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
   and a 1MiB prealloc extent at file offset 128KiB;

2) Task A starts doing a shrinking truncate of our inode to reduce it to
   a size of 64KiB. Before it searches the subvolume tree for file
   extent items to delete, it drops all the extent maps in the range
   from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();

3) Task B starts doing a fiemap against our inode. When looking up for
   the inode's extent maps in the range from 128KiB to (u64)-1, it
   doesn't find any in the inode's extent map tree, since they were
   removed by task A.  Because it didn't find any in the extent map
   tree, it scans the inode's subvolume tree for file extent items, and
   it finds the 1MiB prealloc extent at file offset 128KiB, then it
   creates an extent map based on that file extent item and adds it to
   inode's extent map tree (this ends up being done by
   btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
   get_extent_skip_holes());

4) Task A then drops the prealloc extent at file offset 128KiB and
   shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
   truncation operation finishes and we end up with an extent map
   representing a 1MiB prealloc extent at file offset 128KiB, despite we
   don't have any more that extent;

After this the two types of problems we have are:

1) Future calls to fiemap always report that a 1MiB prealloc extent
   exists at file offset 128KiB. This is stale information, no longer
   correct;

2) If the size of the file is increased, by a truncate operation that
   increases the file size or by a write into a file offset > 64KiB for
   example, we end up not inserting file extent items to represent holes
   for any range between 128KiB and 128KiB + 1MiB, since the hole
   expansion function, btrfs_cont_expand() will skip hole insertion for
   any range for which an extent map exists that represents a prealloc
   extent. This causes fsck to complain about missing file extent items
   when not using the NO_HOLES feature.

The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.

Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:17:10 +01:00
David Sterba 10a3a3edc5 btrfs: log message when rw remount is attempted with unclean tree-log
A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.

A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:17:00 +01:00
David Sterba e8294f2f6a btrfs: print message when tree-log replay starts
There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.

Suggested-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:16:57 +01:00
Filipe Manana ac05ca913e Btrfs: fix race between using extent maps and merging them
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:16:46 +01:00
Wenwen Wang f311ade3a7 btrfs: ref-verify: fix memory leaks
In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.

To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.

CC: stable@vger.kernel.org # 4.15+
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 17:16:31 +01:00
Linus Torvalds 359c92c02b dax fixes 5.6-rc1
- Fix RWF_NOWAIT writes to properly return -EAGAIN
 
 - Clean up an unused helper
 
 - Update dax_writeback_mapping_range to not need a block_device argument
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl5DH00ACgkQHtKRamZ9
 iAIPUA/+KHTrnE4Pb69d1r/wmdNOQE1vCpZxN606ozfzAMIfCgzw6i7V82xpFUyZ
 VdQ1wcCSYLoPTQ+y/0Bk7SQlLjBx58c0ShaOTHmE2IRdYwzKMOBtwqy5vxGYWT/i
 k2b44Xdwc4x8KMdajKCZlZOWNIM4BnlhE6nqnyL7Zv7J4BC71IKTCgh0WrriKvKe
 t7IkYK6fLWx9Y64UhcIoXL9jDp1r5N/pJCjKaSfpS7gH+iqz/M3NaAwRfDr8UuIY
 aHUUoPHP/cbwakQnYZtpxGaP1dzkKQ1FG3nL74Lp2XUOulivSbGa0fmw/7Vs18Of
 M2d8/yKWMh0pElPdoh/2ORhHAcMOsIUCx3HRxMm1x6g293BkE96ESpbN/s62oo7H
 uND3AOqnE79jxd6AqECHyYXGpxwlHah5HXZdCjU5b6rmNbz9YNpHGK8STwpa2ReL
 AnYpWlDPjUkSMAD/rzwR7T6xh4TlzYQa2y6QR3HyffftPg6Dm0g4I8pBi0PVZYBB
 4Whg8dLsiK73KZsjhraPaSFFDT42Btd6BHKLrMLckoVoIoyt3EB4FPwMQjCwXgqI
 WySehmiMfEqXWUEKsYxBf+j0+ASC5ewIKuP1Ziqxa9kBel7964my7ariGn/mEAbo
 yJ6Iyn+wEeLE5VhImZKxrBNqNPo6mripT6omJmgl2G2HZt+JfK4=
 =ZA5Y
 -----END PGP SIGNATURE-----

Merge tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax fixes from Dan Williams:
 "A fix for an xfstest failure and some and an update that removes an
  fsdax dependency on block devices.

  Summary:

   - Fix RWF_NOWAIT writes to properly return -EAGAIN

   - Clean up an unused helper

   - Update dax_writeback_mapping_range to not need a block_device
     argument"

* tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: pass NOWAIT flag to iomap_apply
  dax: Get rid of fs_dax_get_by_host() helper
  dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
2020-02-11 16:52:08 -08:00
Xiubo Li 3b20bc2fe4 ceph: noacl mount option is effectively ignored
For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.

But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.

Fixes: 82995cc6c5 ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-02-11 17:04:40 +01:00
Ilya Dryomov b27a939e83 ceph: canonicalize server path in place
syzbot reported that 4fbc0c711b ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.

We don't really need the supplied server path, so canonicalize it
in place and compare it directly.  To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored.  CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.

Fixes: 4fbc0c711b ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-02-11 17:04:40 +01:00
Xiubo Li 8e4473bb50 ceph: do not execute direct write in parallel if O_APPEND is specified
In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.

For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:

          Writer1                         Writer2

     shared_lock()                   shared_lock()
     getattr(CAP_SIZE)               getattr(CAP_SIZE)
     iocb->ki_pos = EOF              iocb->ki_pos = EOF
     write(data1)
                                     write(data2)
     shared_unlock()                 shared_unlock()

The data2 will overlap the data1 from the same file offset, the
old EOF.

Switch to exclusive lock instead when O_APPEND is specified.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-02-11 17:04:04 +01:00
Trond Myklebust cf5b4059ba NFSv4: Fix races between open and dentry revalidation
We want to make sure that we revalidate the dentry if and only if
we've done an OPEN by filename.
In order to avoid races with remote changes to the directory on the
server, we want to save the verifier before calling OPEN. The exception
is if the server returned a delegation with our OPEN, as we then
know that the filename can't have changed on the server.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-10 10:50:59 -05:00
Trond Myklebust a1147b8281 NFS: Fix up directory verifier races
In order to avoid having our dentry revalidation race with an update
of the directory on the server, we need to store the verifier before
the RPC calls to LOOKUP and READDIR.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-10 10:38:48 -05:00
Petr Pavlu 3f6166aaf1 cifs: fix mount option display for sec=krb5i
Fix display for sec=krb5i which was wrongly interleaved by cruid,
resulting in string "sec=krb5,cruid=<...>i" instead of
"sec=krb5i,cruid=<...>".

Fixes: 96281b9e46 ("smb3: for kerberos mounts display the credential uid used")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-10 08:32:30 -06:00
Linus Torvalds 89a47dd1af Kbuild updates for v5.6 (2nd)
- fix randconfig to generate a sane .config
 
  - rename hostprogs-y / always to hostprogs / always-y, which are
    more natual syntax.
 
  - optimize scripts/kallsyms
 
  - fix yes2modconfig and mod2yesconfig
 
  - make multiple directory targets ('make foo/ bar/') work
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
 lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
 QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
 HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
 O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
 OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
 kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
 /KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
 u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
 FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
 rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
 NC/coUmQX7McxMXh
 =sv4U
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - fix randconfig to generate a sane .config

 - rename hostprogs-y / always to hostprogs / always-y, which are more
   natual syntax.

 - optimize scripts/kallsyms

 - fix yes2modconfig and mod2yesconfig

 - make multiple directory targets ('make foo/ bar/') work

* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: make multiple directory targets work
  kconfig: Invalidate all symbols after changing to y or m.
  kallsyms: fix type of kallsyms_token_table[]
  scripts/kallsyms: change table to store (strcut sym_entry *)
  scripts/kallsyms: rename local variables in read_symbol()
  kbuild: rename hostprogs-y/always to hostprogs/always-y
  kbuild: fix the document to use extra-y for vmlinux.lds
  kconfig: fix broken dependency in randconfig-generated .config
2020-02-09 16:05:50 -08:00
Linus Torvalds 380a129eb2 fs: New zonefs file system
Zonefs is a very simple file system exposing each zone of a zoned block
 device as a file.
 
 Unlike a regular file system with native zoned block device support
 (e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
 sequential write constraint of zoned block devices to the user. As a
 result, zonefs is not a POSIX compliant file system. Its goal is to
 simplify the implementation of zoned block devices support in
 applications by replacing raw block device file accesses with a richer
 file based API, avoiding relying on direct block device file ioctls
 which may be more obscure to developers.
 
 One example of this approach is the implementation of LSM
 (log-structured merge) tree structures (such as used in RocksDB and
 LevelDB) on zoned block devices by allowing SSTables to be stored in a
 zone file similarly to a regular file system rather than as a range of
 sectors of a zoned device. The introduction of the higher level
 construct "one file is one zone" can help reducing the amount of changes
 needed in the application while at the same time allowing the use of
 zoned block devices with various programming languages other than C.
 
 Zonefs IO management implementation uses the new iomap generic code.
 Zonefs has been successfully tested using a functional test suite
 (available with zonefs userland format tool on github) and a prototype
 implementation of LevelDB on top of zonefs.
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXj1y8QAKCRDdoc3SxdoY
 dqozAP9J3t+Q95BgKgI5jP+XEtyYsPBTaVrvaSaViEnwtJLVoQD/ZQ1lTCZSE9OI
 UkvWawkuFtLGfOxTqyA3eZrZi22Ttwk=
 =YVvO
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull new zonefs file system from Damien Le Moal:
 "Zonefs is a very simple file system exposing each zone of a zoned
  block device as a file.

  Unlike a regular file system with native zoned block device support
  (e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
  sequential write constraint of zoned block devices to the user. As a
  result, zonefs is not a POSIX compliant file system. Its goal is to
  simplify the implementation of zoned block devices support in
  applications by replacing raw block device file accesses with a richer
  file based API, avoiding relying on direct block device file ioctls
  which may be more obscure to developers.

  One example of this approach is the implementation of LSM
  (log-structured merge) tree structures (such as used in RocksDB and
  LevelDB) on zoned block devices by allowing SSTables to be stored in a
  zone file similarly to a regular file system rather than as a range of
  sectors of a zoned device. The introduction of the higher level
  construct "one file is one zone" can help reducing the amount of
  changes needed in the application while at the same time allowing the
  use of zoned block devices with various programming languages other
  than C.

  Zonefs IO management implementation uses the new iomap generic code.
  Zonefs has been successfully tested using a functional test suite
  (available with zonefs userland format tool on github) and a prototype
  implementation of LevelDB on top of zonefs"

* tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Add documentation
  fs: New zonefs file system
2020-02-09 15:51:46 -08:00
Linus Torvalds d1ea35f4cd 13 cifs/smb3 patches most from testing at the SMB3 plugfest this week
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl49bNsACgkQiiy9cAdy
 T1EGlQwArDJiHUV7W/WaoDZnusPPQqUT3ayqAHL0P8cDsjxLu3uNMkUISr0HdbxC
 kqYahSTb+/BKQzoZhVe5wK3S8W6R8+wyaPJExRCL3brlIHVP/eC9uUjSgkT6QVDl
 /vZCwxj7KmTK/S+ofji/XTl2f8f8BCw2biGVxwR2Jj5pwKI4wFIMFm7mDetTQRD4
 bK0UR2Owiw4DpPXdwHlXPf9N06z0ETa1UdMXklIBgeK9B1eT1STD9q/iHJh3bLpO
 klhbiq5eGRCcs9cBVTQcn6U+zGYBOcdJuhPGbAObEU+R2vNX06clydKlKy1oz1VL
 4jbVVn9xuGZ9evFBC3h7Na1X7C3V28WcpfeRfFxZ157hNuQSNo5wiq0rF66EQ14U
 hbmlx2S2ooyNKcnrj46SUw9zVLZ0xcx1Mw7kmoyHgI/vznW9fvV0Y2JXawJMPei5
 VuQTgDLFsvnIIrUnrGBu2UXMzXghxLZ3SXJVKXuW3luvNRk82RAGHmIdty3OTgPp
 DN9lhGvv
 =F1qf
 -----END PGP SIGNATURE-----

Merge tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "13 cifs/smb3 patches, most from testing at the SMB3 plugfest this week:

   - Important fix for multichannel and for modefromsid mounts.

   - Two reconnect fixes

   - Addition of SMB3 change notify support

   - Backup tools fix

   - A few additional minor debug improvements (tracepoints and
     additional logging found useful during testing this week)"

* tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: Add defines for new information level, FileIdInformation
  smb3: print warning once if posix context returned on open
  smb3: add one more dynamic tracepoint missing from strict fsync path
  cifs: fix mode bits from dir listing when mounted with modefromsid
  cifs: fix channel signing
  cifs: add SMB3 change notification support
  cifs: make multichannel warning more visible
  cifs: fix soft mounts hanging in the reconnect code
  cifs: Add tracepoints for errors on flush or fsync
  cifs: log warning message (once) if out of disk space
  cifs: fail i/o on soft mounts if sessionsetup errors out
  smb3: fix problem with null cifs super block with previous patch
  SMB3: Backup intent flag missing from some more ops
2020-02-09 13:27:17 -08:00
Linus Torvalds 5586c3c1e0 Merge branch 'work.vboxsf' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vboxfs from Al Viro:
 "This is the VirtualBox guest shared folder support by Hans de Goede,
  with fixups for fs_parse folded in to avoid bisection hazards from
  those API changes..."

* 'work.vboxsf' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Add VirtualBox guest shared folder (vboxsf) support
2020-02-09 12:41:00 -08:00
Jens Axboe b537916ca5 io_uring: retain sockaddr_storage across send/recvmsg async punt
Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.

Cc: stable@vger.kernel.org # 5.3+
Reported-by: Jonas Bonn <jonas@norrbonn.se>
Tested-by: Jonas Bonn <jonas@norrbonn.se>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 11:32:10 -07:00
Jens Axboe 6ab231448f io_uring: cancel pending async work if task exits
Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.

Cancel work that this task queued up when it goes away.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 09:55:38 -07:00
Jens Axboe 36282881a7 io-wq: add io_wq_cancel_pid() to cancel based on a specific pid
Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 09:55:38 -07:00
Jens Axboe 00bcda13dc io-wq: make io_wqe_cancel_work() take a match handler
We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 09:55:37 -07:00
Hans de Goede 0fd1695766 fs: Add VirtualBox guest shared folder (vboxsf) support
VirtualBox hosts can share folders with guests, this commit adds a
VFS driver implementing the Linux-guest side of this, allowing folders
exported by the host to be mounted under Linux.

This driver depends on the guest <-> host IPC functions exported by
the vboxguest driver.

Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-08 17:34:58 -05:00
Linus Torvalds b85080c106 compat-ioctl fix for v5.6
One patch in the compat-ioctl series broke 32-bit rootfs for multiple
 people testing on 64-bit kernels. Let's fix it in -rc1 before others
 run into the same issue.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJePyKSAAoJEGCrR//JCVInFMEP/0fH0rI/M2xL4neSjp50qV7Q
 M86gLTmatgNFLZGdC+xbaImaxpZHEMIF93JgOzN+0IRfdc7MbD23JEm5yzCZEc2I
 NCAsmjcoA+mq8ntOl8o2J1pgLpVOe4BoBRvAGYFIM2kS63JXloCi9mD/3svDJF8C
 WVPOskzoNT1pO9mRUKsAE740qdI86US/ksvrGOQQHVFVm3Iwm3srot4OBQYRaErw
 bVEl8vqSOEKtXQwk3r7QcW9Bi83WZuqTbkLxqTtWm/U/8JreA82qH1/6zDbKj7d5
 IH/J8D4FJ2LLRP5860bLj0vGl5mKBEJaou2S8ak7sRNu4K8DZJlTYu7YNMTgocQE
 kbNylNqftKastVRGS1EPmOPdam8PiPlu3sxOY1X1TKO5+4jtb1WcqtFqmUkM5/Jk
 C4iMDXNHXmHcC+ORrjchpIZs53nkfSAdeUzQ7xxwZlfaHD3DXnsFi5DHEZ9PWf9a
 UlCOaWFLXRqdso30iobaaLa6JKa50Znjiqfsh6mbUuU9tmDnTffaHMy9eFhfMjqH
 eP3acYXkAyK2toHuE6l5qaM6mOlYfrooCw/75CfQB0CR3e5tmXXLdFA319/OHRwX
 m+tFby6Fw4KGGZWzXls54rZNa3ZmODj6N3Ymx1lKZfBVwblLNGHJYYVPP1ZUBCTS
 43oommoGHO6/roOjLl3j
 =uMS2
 -----END PGP SIGNATURE-----

Merge tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull compat-ioctl fix from Arnd Bergmann:
 "One patch in the compat-ioctl series broke 32-bit rootfs for multiple
  people testing on 64-bit kernels. Let's fix it in -rc1 before others
  run into the same issue"

* tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
  compat_ioctl: fix FIONREAD on devices
2020-02-08 13:44:41 -08:00
Linus Torvalds c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Linus Torvalds 236f453294 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:

 - bmap series from cmaiolino

 - getting rid of convolutions in copy_mount_options() (use a couple of
   copy_from_user() instead of the __get_user() crap)

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  saner copy_mount_options()
  fibmap: Reject negative block numbers
  fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  ecryptfs: drop direct calls to ->bmap
  cachefiles: drop direct usage of ->bmap method.
  fs: Enable bmap() function to properly return errors
2020-02-08 13:04:49 -08:00
Pavel Begunkov 0bdbdd08a8 io_uring: fix openat/statx's filename leak
As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Pavel Begunkov 5f798beaf3 io_uring: fix double prep iovec leak
Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).

Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Pavel Begunkov a93b33312f io_uring: fix async close() with f_op->flush()
First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.

Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.

And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe 0b5faf6ba7 io_uring: allow AT_FDCWD for non-file openat/openat2/statx
Don't just check for dirfd == -1, we should allow AT_FDCWD as well for
relative lookups.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe ff002b3018 io_uring: grab ->fs as part of async preparation
This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe 9392a27d88 io-wq: add support for inheriting ->fs
Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Jens Axboe faac996ccd io_uring: retry raw bdev writes if we hit -EOPNOTSUPP
For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov 8fef80bf56 io_uring: add cleanup for openat()/statx()
openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov 99bc4c3853 io_uring: fix iovec leaks
Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()

Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov e96e977992 io_uring: remove unused struct io_async_open
struct io_async_open is unused, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Stefano Garzarella 63e5d81f72 io_uring: flush overflowed CQ events in the io_uring_poll()
In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.

We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Jens Axboe cf3040ca55 io_uring: statx/openat/openat2 don't support fixed files
All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.

Disable IOSQE_FIXED_FILE for these operations.

Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:33 -07:00
Linus Torvalds 995933305e Merge branch 'pipe-exclusive-wakeup'
Merge thundering herd avoidance on pipe IO.

This would have been applied for 5.5 already, but got delayed because of
a user-space race condition in the GNU make jobserver code.  Now that
there's a new GNU make 4.3 release, and most distributions seem to have
at least applied the (almost three year old) fix for the problem, let's
see if people notice.

And it might have been just bad random timing luck on my machine.

If you do hit the race condition, things will still work, but the
symptom is that you don't get nearly the expected parallelism when using
"make -j<N>".

The jobserver bug can definitely happen without this patch too, but
seems to be easier to trigger when we no longer wake up pipe waiters
unnecessarily.

* pipe-exclusive-wakeup:
  pipe: use exclusive waits when reading or writing
2020-02-08 11:44:02 -08:00
Linus Torvalds 0ddad21d3e pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

    #include <unistd.h>

    int main(int argc, char **argv)
    {
        int fd[2], counters[2];

        pipe(fd);
        counters[0] = 0;
        counters[1] = -1;
        write(fd[1], counters, sizeof(counters));

        /* 64 processes */
        fork(); fork(); fork(); fork(); fork(); fork();

        do {
                int i;
                read(fd[0], &i, sizeof(i));
                if (i < 0)
                        continue;
                counters[0] = i+1;
                write(fd[1], counters, (1+(i & 1)) *sizeof(int));
        } while (counters[0] < 1000000);
        return 0;
    }

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

        231,852.37 msec task-clock                #   15.857 CPUs utilized
        11,250,961      context-switches          #    0.049 M/sec
           616,304      cpu-migrations            #    0.003 M/sec
             1,648      page-faults               #    0.007 K/sec
 1,097,903,998,514      cycles                    #    4.735 GHz
   120,781,778,352      instructions              #    0.11  insn per cycle
    27,997,056,043      branches                  #  120.754 M/sec
       283,581,233      branch-misses             #    1.01% of all branches

      14.621273891 seconds time elapsed

       0.018243000 seconds user
       3.611468000 seconds sys

before this commit.

After this commit, I get

          5,229.55 msec task-clock                #    3.072 CPUs utilized
         1,212,233      context-switches          #    0.232 M/sec
           103,951      cpu-migrations            #    0.020 M/sec
             1,328      page-faults               #    0.254 K/sec
    21,307,456,166      cycles                    #    4.074 GHz
    12,947,819,999      instructions              #    0.61  insn per cycle
     2,881,985,678      branches                  #  551.096 M/sec
        64,267,015      branch-misses             #    2.23% of all branches

       1.702148350 seconds time elapsed

       0.004868000 seconds user
       0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
  race condition in the make jobserver (in GNU make 4.2.1) for me. It's
  a long known bug that was fixed back in June 2017 by GNU make commit
  b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
  avoid hangs.").

  But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
  so a number of distributions may still have the buggy version. Some
  have backported the fix to their 4.2.1 release, though, and even
  without the fix it's quite timing-dependent whether the bug actually
  is hit. ]

Josh Triplett says:
 "I've been hammering on your pipe fix patch (switching to exclusive
  wait queues) for a month or so, on several different systems, and I've
  run into no issues with it. The patch *substantially* improves
  parallel build times on large (~100 CPU) systems, both with parallel
  make and with other things that use make's pipe-based jobserver.

  All current distributions (including stable and long-term stable
  distributions) have versions of GNU make that no longer have the
  jobserver bug"

Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-08 11:39:19 -08:00
Arnd Bergmann 0a061743af compat_ioctl: fix FIONREAD on devices
My final cleanup patch for sys_compat_ioctl() introduced a regression on
the FIONREAD ioctl command, which is used for both regular and special
files, but only works on regular files after my patch, as I had missed
the warning that Al Viro put into a comment right above it.

Change it back so it can work on any file again by moving the implementation
to do_vfs_ioctl() instead.

Fixes: 77b9040195 ("compat_ioctl: simplify the implementation")
Reported-and-tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Reported-and-tested-by: youling257 <youling257@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-02-08 18:02:54 +01:00
Linus Torvalds f757165705 fuse fixes for 5.6-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXj182AAKCRDh3BK/laaZ
 PIiWAQCprdMIBe0u9Rd9cqQYXClOI7PI9oIcpLmkIlHDuUWDgQD/Y4c1UMsN8yQY
 d8cYZXMivKKhyY2nRitR1mbv0RPoGwE=
 =8hFo
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:

 - Fix a regression introduced in v5.1 that triggers WARNINGs for some
   fuse filesystems

 - Fix an xfstest failure

 - Allow overlayfs to be used on top of fuse/virtiofs

 - Code and documentation cleanups

* tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: use true,false for bool variable
  Documentation: filesystems: convert fuse to RST
  fuse: Support RENAME_WHITEOUT flag
  fuse: don't overflow LLONG_MAX with end offset
  fix up iter on short count in fuse_direct_io()
2020-02-07 17:59:07 -08:00
Linus Torvalds 175787e011 Changes in gfs2:
- Fix a bug in Abhi Das's journal head lookup improvements that can cause a
   valid journal to be rejected.
 - Fix an O_SYNC write handling bug reported by Christoph Hellwig.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl49TKgUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqG4RAAmXJfWI7UskM3ZRSVivbBwc0R3P9h
 PgdgeXlr/k4mJ2I421/btln614l4nD+JuG/FplOvj2m3YCzeKj2W+OiSxqahmwGm
 7yGb0sFHjx6wZcID37mig5aqo2MJYI0xTAR6rSI2tAA7E/B7KX8SfFIS2EWB2vLz
 +rzCawqUGWKCiS/tZTe6JlB0Aeg20oy0Y00p3gewMN0ILNqy0w9kA+LVOGQYyUmW
 rrad638czIXi7kugWvNo82vcoU140m3A6OTaINf3EaT8AmOtw+e4qyej+f82BiVS
 RIXWKI+uRSfKFE+aYkwTxQn0BCMor63QIs3aaDXyLZBqnhTywRsckK6O7iBLFVDb
 NQc86wxiHzDoWubXstV0lrlel5m5dHT0dUq8mVogj4kOtwOvejiDRopqtdGkdxA5
 j8zFV6O8BRPUw5g7MS37n9myuTNRj5q3L2vmN2xZ6fZmHykorwbwLbw0WMeAQ/4i
 pQh4NGE6h/lSPZIa/ZVCH8hwj2b41ZtAQK3k8G5IhNUhMHNkU8iynTbddoHFPsOF
 67hj/aAycDx2NA6j/suS8PGt7PPU+VwwoDLKIG64bTHFTpwnRDNzZDt9wPIB9sGu
 P4QlSQYPrLf/u8+TD8yv2VJsyAEMFl1UQWbxqf6EzLtKjRM/TIoQsdRUF7upqUnN
 k5CM8FgmfYpIiFQ=
 =+6PA
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:

 - Fix a bug in Abhi Das's journal head lookup improvements that can
   cause a valid journal to be rejected.

 - Fix an O_SYNC write handling bug reported by Christoph Hellwig.

* tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: fix O_SYNC write handling
  gfs2: move setting current->backing_dev_info
  gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
2020-02-07 17:54:46 -08:00
Linus Torvalds 60ea27e936 orangefs: a debugfs fix
Vasliy Averin noticed that "if seq_file .next function does not change
 position index, read after some lseek can generate unexpected output."
 and sent in this fix.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJePJJTAAoJEM9EDqnrzg2+S8YQAIYo2nEphfAjWufSYdPwVY0f
 OQ3xAwX69gpzlQPwNTcTmoxQ/A/E8r2lwOmL5H7wYPlqUHtDjiS5EAhnCvtCIRS1
 mHR35qsdlW0o4yKWQtYnfcuY6p3ucSASBaKYGO5NrKuVdsqASJT/aFTixVFmO8V2
 eo8VZlhwYARTgGWJQpvIvSbQJRi6BDnE3kWimuRdq5TMnH0piMm6Vpef2P5vTaE3
 L5D63iCuJbg7nBR4KS2Mc77Ou4lj3RCQGBQBz1hRGs0zJbqNrlExGaCE+FkabJ92
 WWITYrh4DaA+wAhmkMbxh8g7DThZFkkK+fsNOZIlSVry3wFuZ1E+6UFUHVICrgxG
 y99wGKV816SfbKD0DZ9cer6slMnNXRRNWzbXqOzey85UmNQSuCNZrVokM8TE1Pve
 q7OOxoqg6N/S46gc+3nrv0kNZ9eg1rh4jyvCdtL4we1AgPPb6O+ihuxlrj9oFPlW
 zo9wHn7Zt1xu+jtQJxR5tChpD5FmO1im0LOYeaJHv76zhOakRTcQ19OnQMUlQ/gk
 ez+5T5LxfmYUHew6kTnGvvp76zZF5QkylsUVOLPke8KChXvLP06ib9Nuya7lBX7+
 WtvLKVRNYMmwose1HsYjQo2DgD0p2fdJhEe61Q5VDFCBI8X8wkv7fV+Hm2ud3/hz
 b9RfP2CnwJbCKzHTolKE
 =tjs3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fix from Mike Marshall:
 "Debugfs fix for orangefs.

  Vasliy Averin noticed that 'if seq_file .next function does not change
  position index, read after some lseek can generate unexpected output'
  and sent in this fix"

* tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  help_next should increase position index
2020-02-07 17:52:38 -08:00
Linus Torvalds 08dffcc7d9 Highlights:
- Server-to-server copy code from Olga.  To use it, client and
 	  both servers must have support, the target server must be able
 	  to access the source server over NFSv4.2, and the target
 	  server must have the inter_copy_offload_enable module
 	  parameter set.
 	- Improvements and bugfixes for the new filehandle cache,
 	  especially in the container case, from Trond
 	- Also from Trond, better reporting of write errors.
 	- Y2038 work from Arnd.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl490mAVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+HkkP/33CsYXp0wvfNrfxCY3zHRxHpfw+
 T9Ownxxw0RAJc/dRluC/2PIKJ20uVqtLrplU63bMBqJn84WF7OALq9twZ79a3fVF
 mvdmnZbNq9B3ncKJlT7akkEelyJCRap7NgG/oTyubE8MlPl6gKpD8c+G7XdW/uN+
 r0fprQz4rW4CYCBGSHq7HusEKqY4Gw+gbyAfJ6A79TMjF1ei51PG+9c8rkIsI5CO
 1TQ3gY1gSJmGf2DoF86Q9WTVb+DvRTEs+t7QkxY/Vlo+QXY8CZyu+qSxN7i/F20m
 gv2GrSpQMS9DEK/ZaG6cxaH+sM18Db4KLvcl3koL6lONHDR2OafSdKLyy0I60jhO
 WfDSHhfDCrAdASTjNlTPrjBrdK3gafiaJVL9vy901ZJjPaNb3EH0nMQ5bEvOBECq
 TCqPcQUcbku+qUVIcFwzSK1hXQFQHNh8WIuqXvNviZIzFDoipwsHVnQK02Owj89L
 R2tbZue1O8voacg/9xw3tWAT7pI+SaBb0EvJuqRxBshiZEU8kKKtMchOwSECRDcu
 k4lcqC5EFW7e4EzGlr6Wx8sI5lwCapva8ccjmPXX+R/vyM81oxWGB84GqWjjwubH
 3Fcok23F9rW2IQJkqgPlNj/9hAjTn2+vM13UbfMlnchGNsQ2gbkc5CDGC/J6Wwpo
 tHVristV9Gu5bJym
 =FxLY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Server-to-server copy code from Olga.

     To use it, client and both servers must have support, the target
     server must be able to access the source server over NFSv4.2, and
     the target server must have the inter_copy_offload_enable module
     parameter set.

   - Improvements and bugfixes for the new filehandle cache, especially
     in the container case, from Trond

   - Also from Trond, better reporting of write errors.

   - Y2038 work from Arnd"

* tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
  sunrpc: expiry_time should be seconds not timeval
  nfsd: make nfsd_filecache_wq variable static
  nfsd4: fix double free in nfsd4_do_async_copy()
  nfsd: convert file cache to use over/underflow safe refcount
  nfsd: Define the file access mode enum for tracing
  nfsd: Fix a perf warning
  nfsd: Ensure sampling of the write verifier is atomic with the write
  nfsd: Ensure sampling of the commit verifier is atomic with the commit
  sunrpc: clean up cache entry add/remove from hashtable
  sunrpc: Fix potential leaks in sunrpc_cache_unhash()
  nfsd: Ensure exclusion between CLONE and WRITE errors
  nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
  nfsd: Update the boot verifier on stable writes too.
  nfsd: Fix stable writes
  nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
  nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
  nfsd: Reduce the number of calls to nfsd_file_gc()
  nfsd: Schedule the laundrette regularly irrespective of file errors
  nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
  nfsd: Containerise filecache laundrette
  ...
2020-02-07 17:50:21 -08:00
Linus Torvalds f43574d0ac NFS Client Updates for Linux 5.6
Stable bugfixes:
 - Fix memory leaks and corruption in readdir # v2.6.37+
 - Directory page cache needs to be locked when read # v2.6.37+
 
 New features:
 - Convert NFS to use the new mount API
 - Add "softreval" mount option to let clients use cache if server goes down
 - Add a config option to compile without UDP support
 - Limit the number of inactive delegations the client can cache at once
 - Improved readdir concurrency using iterate_shared()
 
 Other bugfixes and cleanups:
 - More 64-bit time conversions
 - Add additional diagnostic tracepoints
 - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
 - Various xprtrdma cleanups to prepare for 5.7's changes
 - Several fixes for NFS writeback and commit handling
 - Fix acls over krb5i/krb5p mounts
 - Recover from premature loss of openstateids
 - Fix NFS v3 chacl and chmod bug
 - Compare creds using cred_fscmp()
 - Use kmemdup_nul() in more places
 - Optimize readdir cache page invalidation
 - Lease renewal and recovery fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl48kMUACgkQ18tUv7Cl
 QOs/bA/+KAHaee+1jWdgRS88CnNDfeokU2sGWuyXWrVTmiKZ+IjnIUIWqmeKhVyg
 RTbaG4PGTIwiLDFibgzdnc3cTOQEgLnVGWWZ50Xh3b7ubock7+/4JHxqZS+/f3vf
 yqwM0dZaXi5Kcx1kEJ+niBxuzkc9mFI+nHh+wLIlin/kaaUdLKu7mP3NXj2cmWxN
 NoRaKc2gEvkPHhPSH4Z1DVXTHxvH2REFvt9APPUgfLfqcUVHV9b7V/wI/roiGWMn
 53h6f38IdqoNQIpzMog/k/va67NLmEvUZOlpCYPyanPOjuxTrmi8iC2S6gLEOjtc
 GGnQnc5skVL31seFR1NbOJiiN3hTLTncnoXza0cKtYxmo7a/FjXApw4jCu3Rkrav
 UXpCI4O6+2AVVG+pEPbjQy3/GEImeoGvp+xr57jBSZBHoDZU9LDwag65qvZ1btIq
 KOBx2gweQz0aB2heXmfee7qzxFdftHmtMWhIMnJASKNuAWGL23Scqem+d97i2T6H
 7y9OJ3aOXiYxFMLYJCsLWjUJxYiaIANNBmHMjf27mZzcdDuxGFms277CMpNPr3SU
 WZk6/oKw9jaRSzHzaKgVDXiULLXQE1/xZ/mvgR/zk1QAusyeXPvVnMdxoRdxFdXb
 QGZHgUqvFvYi8Lufvs+ZLGS4sAp7oD/Q+lNPXn7cniSwfY4uJiw=
 =b6+F
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Puyll NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix memory leaks and corruption in readdir # v2.6.37+
   - Directory page cache needs to be locked when read # v2.6.37+

  New features:
   - Convert NFS to use the new mount API
   - Add "softreval" mount option to let clients use cache if server goes down
   - Add a config option to compile without UDP support
   - Limit the number of inactive delegations the client can cache at once
   - Improved readdir concurrency using iterate_shared()

  Other bugfixes and cleanups:
   - More 64-bit time conversions
   - Add additional diagnostic tracepoints
   - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
   - Various xprtrdma cleanups to prepare for 5.7's changes
   - Several fixes for NFS writeback and commit handling
   - Fix acls over krb5i/krb5p mounts
   - Recover from premature loss of openstateids
   - Fix NFS v3 chacl and chmod bug
   - Compare creds using cred_fscmp()
   - Use kmemdup_nul() in more places
   - Optimize readdir cache page invalidation
   - Lease renewal and recovery fixes"

* tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
  NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
  NFSv4: try lease recovery on NFS4ERR_EXPIRED
  NFS: Fix memory leaks
  nfs: optimise readdir cache page invalidation
  NFS: Switch readdir to using iterate_shared()
  NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
  NFS: Directory page cache pages need to be locked when read
  NFS: Fix memory leaks and corruption in readdir
  SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
  NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
  NFSv4: Limit the total number of cached delegations
  NFSv4: Add accounting for the number of active delegations held
  NFSv4: Try to return the delegation immediately when marked for return on close
  NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
  NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
  NFS: nfs_find_open_context() should use cred_fscmp()
  NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
  NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
  NFS: remove unused macros
  nfs: Return EINVAL rather than ERANGE for mount parse errors
  ...
2020-02-07 17:39:56 -08:00
Al Viro bf45f7fcc4 procfs: switch to use of invalfc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:42 -05:00
Al Viro b5db30cfb9 hugetlbfs: switch to use of invalfc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:42 -05:00
Al Viro e1ee7d8511 cramfs: switch to use of errofc() et.al.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:41 -05:00
Al Viro 77cb271e6a gfs2: switch to use of errorfc() et.al.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:41 -05:00
Al Viro 2e28c49ea6 fuse: switch to use errorfc() et.al.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:40 -05:00
Al Viro d53d0f7461 ceph: use errorfc() and friends instead of spelling the prefix out
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:39 -05:00
Al Viro 328de5287b turn fs_param_is_... into functions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:38 -05:00
Al Viro 48ce73b1be fs_parse: handle optional arguments sanely
Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.

There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Al Viro d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen 96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Al Viro cc3c0b533a add prefix to fs_context->log
... turning it into struct p_log embedded into fs_context.  Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().

This makes fs_parameter_description->name completely unused.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:35 -05:00
Al Viro c80c98f0dc ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:34 -05:00
Al Viro 7f5d38141e new primitive: __fs_parse()
fs_parse() analogue taking p_log instead of fs_context.
fs_parse() turned into a wrapper, callers in ceph_common and rbd
switched to __fs_parse().

As the result, fs_parse() never gets NULL fs_context and neither
do fs_context-based logging primitives

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:34 -05:00
Al Viro 9f09f649ca teach logfc() to handle prefices, give it saner calling conventions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:32 -05:00
Al Viro aa1918f949 get rid of fs_value_is_filename_empty
Its behaviour is identical to that of fs_value_is_filename.
It makes no sense, anyway - LOOKUP_EMPTY affects nothing
whatsoever once the pathname has been imported from userland.
And both fs_value_is_filename and fs_value_is_filename_empty
carry an already imported pathname.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:30 -05:00
Al Viro 34264ae3fa don't bother with explicit length argument for __lookup_constant()
Have the arrays of constant_table self-terminated (by NULL ->name
in the final entry).  Simplifies lookup_constant() and allows to
reuse the search for enum params as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:47:52 -05:00
Chen Zhou 50d0def966 nfsd: make nfsd_filecache_wq variable static
Fix sparse warning:

fs/nfsd/filecache.c:55:25: warning:
	symbol 'nfsd_filecache_wq' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-07 13:30:41 -05:00
Damien Le Moal 8dcc1a9d90 fs: New zonefs file system
zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).

As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.

Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.

The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
   under a common sub-directory:
     * For conventional zones, the sub-directory "cnv" is used.
     * For sequential write zones, the sub-directory "seq" is used.
  These two directories are the only directories that exist in zonefs.
  Users cannot create other directories and cannot rename nor delete
  the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
   type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
   Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
   pointer position relative to the zone start sector. Truncating these
   files is allowed only down to 0, in which case, the zone is reset to
   rewind the zone write pointer position to the start of the zone, or
   up to the zone size, in which case the file's zone is transitioned
   to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
   file zone size. Any access exceeding the zone size is failed with
   the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
   sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
   that can be issued to conventional zone files. Buffered, direct and
   mmap read & write operations are accepted. For sequential zone files,
   there are no restrictions on read operations, but all write
   operations must be direct IO append writes. mmap write of sequential
   files is not allowed.

Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
  zones can be aggregated into a single larger file instead of the
  default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
  (root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
  changed.

The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:

git@github.com:damien-lemoal/zonefs-tools.git.

zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.

Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.

$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).

$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0

This aggregated conventional zone file can be used as a regular file.

$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data

The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.

$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355

For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.

$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

The written file can be truncated to the zone size, preventing any
further write operation.

$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.

$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.

$ stat /mnt/seq/0
  File: /mnt/seq/0
  Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
Device: 870h/2160d      Inode: 50431       Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
 Birth: -

The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.

This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-02-07 14:39:38 +09:00
Al Viro 5eede62529 fold struct fs_parameter_enum into struct constant_table
no real difference now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Al Viro 2710c957a8 fs_parse: get rid of ->enums
Don't do a single array; attach them to fsparam_enum() entry
instead.  And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Al Viro 0f89589a8c Pass consistent param->type to fs_parse()
As it is, vfs_parse_fs_string() makes "foo" and "foo=" indistinguishable;
both get fs_value_is_string for ->type and NULL for ->string.  To make
it even more unpleasant, that combination is impossible to produce with
fsconfig().

Much saner rules would be
        "foo"           => fs_value_is_flag, NULL
	"foo="          => fs_value_is_string, ""
	"foo=bar"       => fs_value_is_string, "bar"
All cases are distinguishable, all results are expressable by fsconfig(),
->has_value checks are much simpler that way (to the point of the field
being useless) and quite a few regressions go away (gfs2 has no business
accepting -o nodebug=, for example).

Partially based upon patches from Miklos.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:10:29 -05:00
Steve French 51d92d69f7 smb3: Add defines for new information level, FileIdInformation
See MS-FSCC 2.4.43.  Valid to be quried from most
Windows servers (among others).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-02-06 17:32:24 -06:00
Steve French ab3459d8f0 smb3: print warning once if posix context returned on open
SMB3.1.1 POSIX Context processing is not complete yet - so print warning
(once) if server returns it on open.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-02-06 17:31:56 -06:00
Steve French 2391ca41b4 smb3: add one more dynamic tracepoint missing from strict fsync path
We didn't have a dynamic trace point for catching errors in
file_write_and_wait_range error cases in cifs_strict_fsync.

Since not all apps check for write behind errors, it can be
important for debugging to be able to trace these error
paths.

Suggested-and-reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-06 17:21:23 -06:00
Aurelien Aptel e3e056c351 cifs: fix mode bits from dir listing when mounted with modefromsid
When mounting with -o modefromsid, the mode bits are stored in an
ACE. Directory enumeration (e.g. ls -l /mnt) triggers an SMB Query Dir
which does not include ACEs in its response. The mode bits in this
case are silently set to a default value of 755 instead.

This patch marks the dentry created during the directory enumeration
as needing re-evaluation (i.e. additional Query Info with ACEs) so
that the mode bits can be properly extracted.

Quick repro:

$ mount.cifs //win19.test/data /mnt -o ...,modefromsid
$ touch /mnt/foo && chmod 751 /mnt/foo
$ stat /mnt/foo
  # reports 751 (OK)
$ sleep 2
  # dentry older than 1s by default get invalidated
$ ls -l /mnt
  # since dentry invalid, ls does a Query Dir
  # and reports foo as 755 (WRONG)

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-02-06 17:19:38 -06:00
Pavel Begunkov 1e95081cb5 io_uring: fix deferred req iovec leak
After defer, a request will be prepared, that includes allocating iovec
if needed, and then submitted through io_wq_submit_work() but not custom
handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
iovec, as it's in io-wq and the code goes as follows:

io_read() {
	if (!io_wq_current_is_worker())
		kfree(iovec);
}

Put all deallocation logic in io_{read,write,send,recv}(), which will
leave the memory, if going async with -EAGAIN.

It also fixes a leak after failed io_alloc_async_ctx() in
io_{recv,send}_msg().

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 13:58:57 -07:00
Randy Dunlap e1d85334d6 io_uring: fix 1-bit bitfields to be unsigned
Make bitfields of size 1 bit be unsigned (since there is no room
for the sign bit).
This clears up the sparse warnings:

  CHECK   ../fs/io_uring.c
../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield

Found by sight and then verified with sparse.

Fixes: 69b3e54613 ("io_uring: change io_ring_ctx bool fields into bit fields")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 13:41:00 -07:00
Pavel Begunkov 1cb1edb2f5 io_uring: get rid of delayed mm check
Fail fast if can't grab mm, so past that requests always have an mm
when required. This allows us to remove req->user altogether.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 12:53:10 -07:00
Aurelien Aptel cc95b67727 cifs: fix channel signing
The server var was accidentally used as an iterator over the global
list of connections, thus overwritten the passed argument. This
resulted in the wrong signing key being returned for extra channels.

Fix this by using a separate var to iterate.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-02-06 12:42:36 -06:00
Andreas Gruenbacher 6e5e41e2dc gfs2: fix O_SYNC write handling
In gfs2_file_write_iter, for direct writes, the error checking in the buffered
write fallback case is incomplete.  This can cause inode write errors to go
undetected.  Fix and clean up gfs2_file_write_iter along the way.

Based on a proposed fix by Christoph Hellwig <hch@lst.de>.

Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 18:49:41 +01:00
Christoph Hellwig 4c0e8dda60 gfs2: move setting current->backing_dev_info
Set current->backing_dev_info just around the buffered write calls to
prepare for the next fix.

Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 17:35:23 +01:00
Abhi Das 7582026f6f gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
When the first log header in a journal happens to have a sequence
number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit,
and return an uninitialized jhead with seq 0. This can cause failures
in the caller. For instance, a mount fails in one test case.

The correct behavior is for it to continue searching through the journal
to find the correct journal head with the highest sequence number.

Fixes: f4686c26ec ("gfs2: read journal in large chunks")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 17:35:23 +01:00
Dan Carpenter 91fd3c3edc nfsd4: fix double free in nfsd4_do_async_copy()
This frees "copy->nf_src" before and again after the goto.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06 11:22:55 -05:00
Trond Myklebust 689827cd5b nfsd: convert file cache to use over/underflow safe refcount
Use the 'refcount_t' type instead of 'atomic_t' for improved
refcounting safety.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06 11:22:55 -05:00
Trond Myklebust c19285596d nfsd: Define the file access mode enum for tracing
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06 11:22:55 -05:00
Trond Myklebust a9ceb060b3 nfsd: Fix a perf warning
perf does not know how to deal with a __builtin_bswap32() call, and
complains. All other functions just store the xid etc in host endian
form, so let's do that in the tracepoint for nfsd_file_acquire too.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06 11:22:54 -05:00
zhengbin cabdb4fa2f fuse: use true,false for bool variable
Fixes coccicheck warning:

fs/fuse/readdir.c:335:1-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/file.c:1398:2-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/file.c:1400:2-20: WARNING: Assignment of 0/1 to bool variable
fs/fuse/cuse.c:454:1-20: WARNING: Assignment of 0/1 to bool variable
fs/fuse/cuse.c:455:1-19: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:497:2-17: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:504:2-23: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:511:2-22: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:518:2-23: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:522:2-26: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:526:2-18: WARNING: Assignment of 0/1 to bool variable
fs/fuse/inode.c:1000:1-20: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06 16:39:28 +01:00
Vivek Goyal 519525fa47 fuse: Support RENAME_WHITEOUT flag
Allow fuse to pass RENAME_WHITEOUT to fuse server.  Overlayfs on top of
virtiofs uses RENAME_WHITEOUT.

Without this patch renaming a directory in overlayfs (dir is on lower)
fails with -EINVAL. With this patch it works.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06 16:39:28 +01:00
Miklos Szeredi 2f1398291b fuse: don't overflow LLONG_MAX with end offset
Handle the special case of fuse_readpages() wanting to read the last page
of a hugest file possible and overflowing the end offset in the process.

This is basically to unbreak xfstests:generic/525 and prevent filesystems
from doing bad things with an overflowing offset.

Reported-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06 16:39:28 +01:00
Miklos Szeredi f658adeea4 fix up iter on short count in fuse_direct_io()
fuse_direct_io() can end up advancing the iterator by more than the amount
of data read or written.  This case is handled by the generic code if going
through ->direct_IO(), but not in the FOPEN_DIRECT_IO case.

Fix by reverting the extra bytes from the iterator in case of error or a
short count.

To test: install lxcfs, then the following testcase
  int fd = open("/var/lib/lxcfs/proc/uptime", O_RDONLY);
  sendfile(1, fd, NULL, 16777216);
  sendfile(1, fd, NULL, 16777216);
will spew WARN_ON() in iov_iter_pipe().

Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 3c3db095b6 ("fuse: use iov_iter based generic splice helpers")
Cc: <stable@vger.kernel.org> # v5.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06 16:39:28 +01:00
Steve French d26c2ddd33 cifs: add SMB3 change notification support
A commonly used SMB3 feature is change notification, allowing an
app to be notified about changes to a directory. The SMB3
Notify request blocks until the server detects a change to that
directory or its contents that matches the completion flags
that were passed in and the "watch_tree" flag (which indicates
whether subdirectories under this directory should be also
included).  See MS-SMB2 2.2.35 for additional detail.

To use this simply pass in the following structure to ioctl:

 struct __attribute__((__packed__)) smb3_notify {
        uint32_t completion_filter;
        bool    watch_tree;
 } __packed;

 using CIFS_IOC_NOTIFY  0x4005cf09
 or equivalently _IOW(CIFS_IOCTL_MAGIC, 9, struct smb3_notify)

SMB3 change notification is supported by all major servers.
The ioctl will block until the server detects a change to that
directory or its subdirectories (if watch_tree is set).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2020-02-06 09:14:28 -06:00
Aurelien Aptel 343a1b777a cifs: make multichannel warning more visible
When no interfaces are returned by the server we cannot open multiple
channels. Make it more obvious by reporting that to the user at the
VFS log level.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-06 09:12:16 -06:00
Ronnie Sahlberg 09c40b1535 cifs: fix soft mounts hanging in the reconnect code
RHBZ: 1795423

This is the SMB1 version of a patch we already have for SMB2

In recent DFS updates we have a new variable controlling how many times we will
retry to reconnect the share.
If DFS is not used, then this variable is initialized to 0 in:

static inline int
dfs_cache_get_nr_tgts(const struct dfs_cache_tgt_list *tl)
{
        return tl ? tl->tl_numtgts : 0;
}

This means that in the reconnect loop in smb2_reconnect() we will immediately wrap retries to -1
and never actually get to pass this conditional:

                if (--retries)
                        continue;

The effect is that we no longer reach the point where we fail the commands with -EHOSTDOWN
and basically the kernel threads are virtually hung and unkillable.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2020-02-06 09:12:00 -06:00
Linus Torvalds 4c46bef2e9 We have:
- a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li).  This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.
 
 - inode initialization fixes (Jeff Layton).  The one included here
   mostly concerns things like open_by_handle() and there is another
   one that will come through Al.
 
 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.
 
 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)
 
 And a smattering of cleanups from Xiubo, Jeff and Chengguang.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl47PUcTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6LoCACmVli5N6bgnBE4sTixi/jz6aCCbk32
 ZPlKiSesHnOGkY6KXHJT58JYy0paITBRik5ypdz06J8aCOtWyPLbn3uCemF9CYn2
 g6dId2Lf5vGFrgSm4YSiqp9a86IZmYSDG41LbJD/IJWFDWdMWqNPMDqji6yaIO5O
 NJI5N0tk+VFXdV+JyjV9X/FnP1r1D2ReZzz21ZiqTJXSmE8YIkioLjkq36QTMMG7
 Gm5qdlc1x2r4qfzA1g+OiWgRQCUMgkuYerFzus4mVbW4hrphsavH2DArbOwFmsXF
 46hOq+1uGVVyZILLJfKNiktf1GExBF0icbSREJtmjUHbQvNR8BH0C+fV
 =vvIc
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:

 - a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li). This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.

 - inode initialization fixes (Jeff Layton). The one included here
   mostly concerns things like open_by_handle() and there is another one
   that will come through Al.

 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.

 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)

... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
  rbd: set the 'device' link in sysfs
  ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  ceph: print name of xattr in __ceph_{get,set}xattr() douts
  ceph: print r_direct_hash in hex in __choose_mds() dout
  ceph: use copy-from2 op in copy_file_range
  ceph: close holes in structs ceph_mds_session and ceph_mds_request
  rbd: work around -Wuninitialized warning
  ceph: allocate the correct amount of extra bytes for the session features
  ceph: rename get_session and switch to use ceph_get_mds_session
  ceph: remove the extra slashes in the server path
  ceph: add possible_max_rank and make the code more readable
  ceph: print dentry offset in hex and fix xattr_version type
  ceph: only touch the caps which have the subset mask requested
  ceph: don't clear I_NEW until inode metadata is fully populated
  ceph: retry the same mds later after the new session is opened
  ceph: check availability of mds cluster on mount after wait timeout
  ceph: keep the session state until it is released
  ceph: add __send_request helper
  ceph: ensure we have a new cap before continuing in fill_inode
  ceph: drop unused ttl_from parameter from fill_inode
  ...
2020-02-06 12:21:01 +00:00
Linus Torvalds 99be3f6098 (More) new code for 5.6:
- Refactor the metadata buffer functions to return the usual int error
 value instead of the open coded error checking mess we have now.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl47QScACgkQ+H93GTRK
 tOssRw//SysStMKUk0nsQOIB+Y0BqzmMyjuY7CLEOWpQeFh5MRMYH288KSCiI5k2
 ljHEXeBUU7AoLQEegL5ivIMa7p4gfzkorRAYrcB7dcPwo4tYqwhfC97yU/5tYuxk
 fOfm0ZaxJ0E+KNDBRd6vqe/lbWE24ySyZWxv7kzJs2ndc3RW4kEFzFFDGIVfi256
 rPMzTxn7B4D0c359o4P0LGP5e5OeUeLH8FrvkITZCml7zMApdpo+eQzn1YxFRcGo
 62daaO2uxtHBVnd30c1BhMPWfXGr+Pqls6QxZKr7YLvGSP5Jb6lRKnB9v3ImmjgH
 OmOq+sXsVgKpNKo4lItnNJditAb0kR0UQHjmEccaUKbkAgEnGkSYqOtPbk2nkHw5
 Eb05y+36DH20GRCp6lKbmdnFOxwL53pfWm8m3xieU/dE/gYH2bphFJNQokm50yaS
 Onoz7zhdvqwLHQafnCLrwZWHVcsEQ1bjKC4nkWZdrcv6UlPYbuTKzJe3OaN79nE2
 IFu9ilhX50M6dS2qsF0NDTEXrPAie6YOlikCvZotJIWaqpzEtWj/+t02jCtPhquC
 M5yBYo0ljA3kYDUMZdng44FaO1h3E8MQIA/+dycyIQWTYIXYPB8mZni8D+YTVTbE
 1jZT8qwBc83mTewYoOV5s+e9ja5hoHZtsZ/KnNssgSbQ66dq/7g=
 =APXi
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull moar xfs updates from Darrick Wong:
 "This contains the buffer error code refactoring I mentioned last week,
  now that it has had extra time to complete the full xfs fuzz testing
  suite to make sure there aren't any obvious new bugs"

* tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix xfs_buf_ioerror_alert location reporting
  xfs: remove unnecessary null pointer checks from _read_agf callers
  xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers
  xfs: remove the xfs_btree_get_buf[ls] functions
  xfs: make xfs_trans_get_buf return an error code
  xfs: make xfs_trans_get_buf_map return an error code
  xfs: make xfs_buf_read return an error code
  xfs: make xfs_buf_get_uncached return an error code
  xfs: make xfs_buf_get return an error code
  xfs: make xfs_buf_read_map return an error code
  xfs: make xfs_buf_get_map return an error code
  xfs: make xfs_buf_alloc return an error code
2020-02-06 07:58:38 +00:00
Linus Torvalds e310396bb8 Tracing updates:
- Added new "bootconfig".
    Looks for a file appended to initrd to add boot config options.
    This has been discussed thoroughly at Linux Plumbers.
    Very useful for adding kprobes at bootup.
    Only enabled if "bootconfig" is on the real kernel command line.
 
  - Created dynamic event creation.
    Merges common code between creating synthetic events and
      kprobe events.
 
  - Rename perf "ring_buffer" structure to "perf_buffer"
 
  - Rename ftrace "ring_buffer" structure to "trace_buffer"
    Had to rename existing "trace_buffer" to "array_buffer"
 
  - Allow trace_printk() to work withing (some) tracing code.
 
  - Sort of tracing configs to be a little better organized
 
  - Fixed bug where ftrace_graph hash was not being protected properly
 
  - Various other small fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
 i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
 =WNds
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Added new "bootconfig".

   This looks for a file appended to initrd to add boot config options,
   and has been discussed thoroughly at Linux Plumbers.

   Very useful for adding kprobes at bootup.

   Only enabled if "bootconfig" is on the real kernel command line.

 - Created dynamic event creation.

   Merges common code between creating synthetic events and kprobe
   events.

 - Rename perf "ring_buffer" structure to "perf_buffer"

 - Rename ftrace "ring_buffer" structure to "trace_buffer"

   Had to rename existing "trace_buffer" to "array_buffer"

 - Allow trace_printk() to work withing (some) tracing code.

 - Sort of tracing configs to be a little better organized

 - Fixed bug where ftrace_graph hash was not being protected properly

 - Various other small fixes and clean ups

* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
  bootconfig: Show the number of nodes on boot message
  tools/bootconfig: Show the number of bootconfig nodes
  bootconfig: Add more parse error messages
  bootconfig: Use bootconfig instead of boot config
  ftrace: Protect ftrace_graph_hash with ftrace_sync
  ftrace: Add comment to why rcu_dereference_sched() is open coded
  tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
  tracing: Annotate ftrace_graph_hash pointer with __rcu
  bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
  tracing: Use seq_buf for building dynevent_cmd string
  tracing: Remove useless code in dynevent_arg_pair_add()
  tracing: Remove check_arg() callbacks from dynevent args
  tracing: Consolidate some synth_event_trace code
  tracing: Fix now invalid var_ref_vals assumption in trace action
  tracing: Change trace_boot to use synth_event interface
  tracing: Move tracing selftests to bottom of menu
  tracing: Move mmio tracer config up with the other tracers
  tracing: Move tracing test module configs together
  tracing: Move all function tracing configs together
  tracing: Documentation for in-kernel synthetic event API
  ...
2020-02-06 07:12:11 +00:00
Linus Torvalds c1ef57a3a3 io_uring-5.6-2020-02-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47MicQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpplgD/4wyOfyMQ601AaiXyBmG6lx7UV7kBBaDWAb
 tlDEh/EWIioejMlYJC8UslLtrlxJS8jKCVJNOAz5zB9V6McLtHNxXNY5pRr4MRrc
 2ztxFHuvy8s+LyztGxBh3DA+bT5UrMR/r6uu6Guh2TatFUZr4IOvBUBb6VeP9O1Z
 sECCkzWZcmIq2gNSh7Dpxr31KdMQo7xngyMhFMh3CHBnDVZN6WX4ugNBJNb71MpY
 ELH3SRY2uX15dlhatO5UYuAknJOA1VvlulYVWCuBj4UPyH0AAUJQiZJVEPwldCNL
 qE4cS80Q5EMAFw32cOW/oyl8Z6oFQO5nwFQ+YPPhaZscjMsRteuqnt6qYSgXHJal
 ze4mUBO9Z1byc9Gex1V5SHZSLzVw3HfgznSUfZrm+Tj2UkocJSaYtS9CzXR8x7tE
 tD8ev4P3EH+axm4oUSWoA4Bro9eGgkV07ok2mCnxb9rJoV0JNHzUmVSzjF4G9HGK
 GosVRRS4I4/nHIZQ3KTKp6apLOAn7SPTUkxqb0/M8qbRXZqQYylWhPsL2Q8aBgvT
 8pQ2sIQ5AgOmzGKqKRofxbhIh8G+6Ddz97A+Omt47zLb8ccsoatXfEli7mMjtH4P
 W/aUE0O8Kstma8gZN4LUxrnqKGncDVJMolozFyt5dWc9bIpxX0SmpDdiRzqyN1fw
 k9L4Ox6hxg==
 =RzPL
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Some later fixes for io_uring:

   - Small cleanup series from Pavel

   - Belt and suspenders build time check of sqe size and layout
     (Stefan)

   - Addition of ->show_fdinfo() on request of Jann Horn, to aid in
     understanding mapped personalities

   - eventfd recursion/deadlock fix, for both io_uring and aio

   - Fixup for send/recv handling

   - Fixup for double deferral of read/write request

   - Fix for potential double completion event for close request

   - Adjust fadvise advice async/inline behavior

   - Fix for shutdown hang with SQPOLL thread

   - Fix for potential use-after-free of fixed file table"

* tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block:
  io_uring: cleanup fixed file data table references
  io_uring: spin for sq thread to idle on shutdown
  aio: prevent potential eventfd recursion on poll
  io_uring: put the flag changing code in the same spot
  io_uring: iterate req cache backwards
  io_uring: punt even fadvise() WILLNEED to async context
  io_uring: fix sporadic double CQE entry for close
  io_uring: remove extra ->file check
  io_uring: don't map read/write iovec potentially twice
  io_uring: use the proper helpers for io_send/recv
  io_uring: prevent potential eventfd recursion on poll
  eventfd: track eventfd_signal() recursion depth
  io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
  io_uring: add ->show_fdinfo() for the io_uring file descriptor
2020-02-06 06:33:17 +00:00
Jeff Moyer 96222d5384 dax: pass NOWAIT flag to iomap_apply
fstests generic/471 reports a failure when run with MOUNT_OPTIONS="-o
dax".  The reason is that the initial pwrite to an empty file with the
RWF_NOWAIT flag set does not return -EAGAIN.  It turns out that
dax_iomap_rw doesn't pass that flag through to iomap_apply.

With this patch applied, generic/471 passes for me.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/x49r1z86e1d.fsf@segfault.boston.devel.redhat.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-05 20:34:32 -08:00
Steve French f2bf09e97b cifs: Add tracepoints for errors on flush or fsync
Makes it easier to debug errors on writeback that happen later,
and are being returned on flush or fsync

For example:
  writetest-17829 [002] .... 13583.407859: cifs_flush_err: ino=90 rc=-28

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-05 18:24:19 -06:00
Steve French d6fd41905e cifs: log warning message (once) if out of disk space
We ran into a confusing problem where an application wasn't checking
return code on close and so user didn't realize that the application
ran out of disk space.  log a warning message (once) in these
cases. For example:

  [ 8407.391909] Out of space writing to \\oleg-server\small-share

Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Oleg Kravtsov <oleg@tuxera.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-02-05 17:58:52 -06:00
Ronnie Sahlberg b0dd940e58 cifs: fail i/o on soft mounts if sessionsetup errors out
RHBZ: 1579050

If we have a soft mount we should fail commands for session-setup
failures (such as the password having changed/ account being deleted/ ...)
and return an error back to the application.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2020-02-05 06:32:41 -06:00