Create a parallel iwalk implementation and switch quotacheck to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
kmemdup is introduced to duplicate a region of memory in a neat way.
Rather than kmalloc/kzalloc + memset, which the programmer needs to
write the size twice (sometimes lead to mistakes), kmemdup improves
readability, leads to smaller code and also reduce the chances of mistakes.
Suggestion to use kmemdup rather than using kmalloc/kzalloc + memset.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Link: https://lore.kernel.org/r/20190703131727.25735-1-huangfq.daxian@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Rewrite gfs2_allocate_page_backing to call gfs2_iomap_get_alloc and operate on
struct iomap directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
No need to indirect through get_blocks and buffer_heads when we can just use
the iomap version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
There is no need to keep these two functions separate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The only difference between the two is that gfs2_ordered_aops sets the
set_page_dirty method to __set_page_dirty_buffers, but given that
__set_page_dirty_buffers is the default, if no method is set, there is no need
to to do that. Merge the two sets of operations into one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl0b67IACgkQiiy9cAdy
T1H/Ngv/XNc9l/OEHwyWZ1QnSCBKZyLyD5ZcKQRFFkfiktmQ8FtPUzf4qKlHxX1h
ssefwBIbkW1+DG2sgvrL7OfqnPDnSezVoifvRmbh0nFX8anWhtChZMc0s+xiLtz2
SbDBugNSkc8l9fvQz5A6VPJ3TcNA+VsSE2rr1HuimS9S4RAy1RsPhhWNyUh3GV5A
SWuD7bsnxZ7/H2l+hx+s2O5RLDFoeniEIGFTsH9/f7Q19YGJtf6arnUlyUaZjkXK
bPV2jZyalRUznK7RSFDLu49fS2zH8/m6MfBYyat31SZVtLFcQC/ijhKYTWr8wrKu
+iQPlX+IDk4rfH/++7PXJJv1sKFLZNEs22dOi1YG0FgkRtMNA8HzmJqVFLcgoB2d
QD7Ahj4dE0ghXv1dLMjfKdchNbkrWiygfpje54AkhU9SWUIS/EljDbQSq3e/wpAW
i9HxCGCmmTPFzVKDVhyaBXHi6h5pzd7FfNNS4iJ2Lsy5PRLOHBMxaX1wknu/8vP0
IIWuB9Hh
=1zkr
-----END PGP SIGNATURE-----
Merge tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"SMB3 fix (for stable as well) for crash mishandling one of the Windows
reparse point symlink tags"
* tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix crash querying symlinks stored as reparse-points
This function was overlooked when the write_begin and write_end address space
operations were removed as part of gfs2's iomap conversion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Without casting page->index to a guaranteed 64-bit type, the value might be
treated as 32-bit on 32-bit platforms and thus get truncated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch allows fallocate to allocate physical blocks for pinned file.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The discard thread should issue upto dpolicy->max_requests at once
and wait for all those discard requests at once it reaches
dpolicy->max_requests. It should then sleep for dpolicy->min_interval
timeout before issuing the next batch of discard requests. But in the
current code of is_idle(), it checks for dcc_info->queued_discard and
aborts issuing the discard batch of max_requests. This
dcc_info->queued_discard will be true always once one discard command
is issued.
It is thus resulting into this type of discard request pattern -
- Issue discard request#1
- is_idle() returns false, discard thread waits for request#1 and then
sleeps for min_interval 50ms.
- Issue discard request#2
- is_idle() returns false, discard thread waits for request#2 and then
sleeps for min_interval 50ms.
- and so on for all other discard requests, assuming f2fs is idle w.r.t
other conditions.
With this fix, the pattern will look like this -
- Issue discard request#1
- Issue discard request#2
and so on upto max_requests of 8
- Issue discard request#8
- wait for min_interval 50ms.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Two paths to update quota and f2fs_lock_op:
1.
- lock_op
| - quota_update
`- unlock_op
2.
- quota_update
- lock_op
`- unlock_op
But, we need to make a transaction on quota_update + lock_op in #2 case.
So, this patch introduces:
1. lock_op
2. down_write
3. check __need_flush
4. up_write
5. if there is dirty quota entries, flush them
6. otherwise, good to go
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If xattr is corrupted, let's print kernel message and set SBI_NEED_FSCK
for further repair.
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs uses EFAULT as error number to indicate filesystem is corrupted
all the time, but generic filesystems use EUCLEAN for such condition,
we need to change to follow others.
This patch adds two new macros as below to wrap more generic error
code macros, and spread them in code.
EFSBADCRC EBADMSG /* Bad CRC detected */
EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Replace the open-coded divisions with round-up by calls to the
DIV_ROUND_UP() helper macro.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Pavel reported, once we detect filesystem inconsistency in
f2fs_inplace_write_data(), it will be better to print kernel message as
we did in other places.
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Add and use f2fs_<level> macros
- Convert f2fs_msg to f2fs_printk
- Remove level from f2fs_printk and embed the level in the format
- Coalesce formats and align multi-line arguments
- Remove unnecessary duplicate extern f2fs_msg f2fs.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This ioctl shrinks a given length (aligned to sections) from end of the
main area. Any cursegs and valid blocks will be moved out before
invalidating the range.
This feature can be used for adjusting partition sizes online.
History of the patch:
Sahitya Tummala:
- Add this ioctl for f2fs_compat_ioctl() as well.
- Fix debugfs status to reflect the online resize changes.
- Fix potential race between online resize path and allocate new data
block path or gc path.
Others:
- Rename some identifiers.
- Add some error handling branches.
- Clear sbi->next_victim_seg[BG_GC/FG_GC] in shrinking range.
- Implement this interface as ext4's, and change the parameter from shrunk
bytes to new block count of F2FS.
- During resizing, force to empty sit_journal and forbid adding new
entries to it, in order to avoid invalid segno in journal after resize.
- Reduce sbi->user_block_count before resize starts.
- Commit the updated superblock first, and then update in-memory metadata
only when the former succeeds.
- Target block count must align to sections.
- Write checkpoint before and after committing the new superblock, w/o
CP_FSCK_FLAG respectively, so that the FS can be fixed by fsck even if
resize fails after the new superblock is committed.
- In free_segment_range(), reduce granularity of gc_mutex.
- Add protection on curseg migration.
- Add freeze_bdev() and thaw_bdev() for resize fs.
- Remove CUR_MAIN_SECS and use MAIN_SECS directly for allocation.
- Recover super_block and FS metadata when resize fails.
- No need to clear CP_FSCK_FLAG in update_ckpt_flags().
- Clean up the sb and fs metadata update functions for resize_fs.
Geert Uytterhoeven:
- Use div_u64*() for 64-bit divisions
Arnd Bergmann:
- Not all architectures support get_user() with a 64-bit argument:
ERROR: "__get_user_bad" [fs/f2fs/f2fs.ko] undefined!
Use copy_from_user() here, this will always work.
Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace the default_attrs field in ext4_sb_ktype
and ext4_feat_ktype with default_groups. Use the ATTRIBUTE_GROUPS macro
to create ext4_groups and ext4_feat_groups.
Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change return type of ecryptfs_process_flags from int to void as it
never fails.
fixes below issue reported by coccicheck
s/ecryptfs/crypto.c:870:5-7: Unneeded variable: "rc". Return "0" on line
883
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
[tyhicks: Remove the return value line from the function documentation]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that we have generic functions to walk inode records, refactor the
INUMBERS implementation to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor xfs_iwalk_ag_start and xfs_iwalk_ag so that the bits that are
particular to bulkstat (trimming the start irec, starting inode
readahead, and skipping empty groups) can be controlled via flags in the
iwag structure.
This enables us to add a new function to walk all inobt records which
will be used for the new INUMBERS implementation in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In preparation for reusing the iwalk code for the inogrp walking code
(aka INUMBERS), move the initial inobt lookup and retrieval code out of
xfs_iwalk_grab_ichunk so that we call the masking code only when we need
to trim out the inodes that came before the cursor in the inobt record
(aka BULKSTAT).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that the inode chunk grabbing function is a static function in the
iwalk code, change its behavior so that @agino is the inode where we
want to /start/ the iteration. This reduces cognitive friction with the
callers and simplifes the code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we've reworked the bulkstat code to use iwalk, we can move the
old bulkstat ichunk helpers to xfs_iwalk.c. No functional changes here.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The existing inode walk prefetch is based on the old bulkstat code,
which simply allocated 4 pages worth of memory and prefetched that many
inobt records, regardless of however many inodes the caller requested.
65536 inodes is a lot to prefetch (~32M on x64, ~512M on arm64) so let's
scale things down a little more intelligently based on the number of
inodes requested, etc.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a new ibulk structure incore to help us deal with bulk inode stat
state tracking and then convert the bulkstat code to use the new iwalk
iterator. This disentangles inode walking from bulk stat control for
simpler code and enables us to isolate the formatter functions to the
ioctl handling code.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When userspace passes in a @lastip pointer we should copy the results
back, even if the @ocount pointer is NULL.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Convert quotacheck to use the new iwalk iterator to dig through the
inodes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create a new iterator function to simplify walking inodes in an XFS
filesystem. This new iterator will replace the existing open-coded
walking that goes on in various places.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Currently, xfs doesn't have generic error codes defined for "stop
iterating"; we just reuse the XFS_BTREE_QUERY_* return values. This
looks a little weird if we're not actually iterating a btree index.
Before we start adding more iterators, we should create general
XFS_ITER_{CONTINUE,ABORT} return values and define the XFS_BTREE_QUERY_*
ones from that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This moves everything out of extent-tree.c to block-rsv.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
block_rsv_release_bytes() is the internal to the block_rsv code, and
shouldn't be called directly by anything else. Switch all users to the
exported helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This works for all callers already, but if we wanted to use the helper
for the global_block_rsv it would end up trying to refill itself. Fix
the logic to be able to be used no matter which block rsv is passed in
to this helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delalloc reserve stuff calls this directly because it cares about
the qgroup accounting stuff, so export it so we can move it around. Fix
btrfs_block_rsv_release() to just be a static inline since it just calls
__btrfs_block_rsv_release() with NULL for the qgroup stuff.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in a few places, we need to make sure it's exported so we
can move it around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prep work for separating out all of the block_rsv related code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need an if-else-if chain where we can use a simple OR since
both conditions are performing the same action. The short-circuit for OR
will ensure that if the first condition is true, can_overcommit() is not
called.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've moved all of the users to space-info.c, unexport it and
name it back to can_overcommit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This moves all of the metadata reservation code into space-info.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We'll need this exported so we can use it in all the various was we need
to use it. This is prep work to move reserve_metadata_bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need this to move the metadata reservation stuff to
space_info.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've moved all the pre-requisite stuff, move these two
functions.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also rename it to btrfs_space_info_update_* so it's clear what we're
updating.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the first piece of moving the space reservation code to
space-info.c
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are the basic init and lookup functions and some helper functions,
fairly straightforward before the bad stuff starts.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prep work for consolidating all of the space_info code into one file.
We need to export these so multiple files can use them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.
Signed-off-by: David Sterba <dsterba@suse.com>
Minimum stripe count matches the minimum devices required for a given
profile. The open coded assignments match the raid_attr table.
What's changed here is the meaning for RAID5/6. Previously their
min_stripes would be 1, while newly it's devs_min. This however shold be
the same as before because it's not possible to create filesystem on
fewer devices than the raid_attr table allows.
There's no adjustment regarding the parity stripes (like
calc_data_stripes does), because we're interested in overall space that
would fit on the devices.
Missing devices make no difference for the whole calculation, we have
the size stored in the structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Special case for DUP can be replaced by lookup to the attribute table,
where the dev_stripes is the right coefficient.
Signed-off-by: David Sterba <dsterba@suse.com>
A few more instances whre we don't need to specify the values as long as
they are the same that enum assigns automatically. All of the enums are
in-memory only and nothing relies on the exact values.
Signed-off-by: David Sterba <dsterba@suse.com>
We have been seeing issues in production where a cleaner script will end
up unlinking a bunch of files that have pending iputs. This means they
will get their final iput's run at btrfs-cleaner time and thus are not
throttled, which impacts the workload.
Since we are unlinking these files we can just drop the delayed iput at
unlink time. We are already holding a reference to the inode so this
will not be the final iput and thus is completely safe to do at this
point. Doing this means we are more likely to be doing the final iput
at unlink time, and thus will get the IO charged to the caller and get
throttled appropriately without affecting the main workload.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the range for which we are punching a hole covers only part of a page,
we end up updating the inode item but we skip the update of the inode's
iversion, mtime and ctime. Fix that by ensuring we update those properties
of the inode.
A patch for fstests test case generic/059 that tests this as been sent
along with this fix.
Fixes: 2aaa665581 ("Btrfs: add hole punching")
Fixes: e8c1c76e80 ("Btrfs: add missing inode update when punching hole")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to avoid searches on a log tree when unlinking an inode, we check
if the inode being unlinked was logged in the current transaction, as well
as the inode of its parent directory. When any of the inodes are logged,
we proceed to delete directory items and inode reference items from the
log, to ensure that if a subsequent fsync of only the inode being unlinked
or only of the parent directory when the other is not fsync'ed as well,
does not result in the entry still existing after a power failure.
That check however is not reliable when one of the inodes involved (the
one being unlinked or its parent directory's inode) is evicted, since the
logged_trans field is transient, that is, it is not stored on disk, so it
is lost when the inode is evicted and loaded into memory again (which is
set to zero on load). As a consequence the checks currently being done by
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
return true if the inode was evicted before, regardless of the inode
having been logged or not before (and in the current transaction), this
results in the dentry being unlinked still existing after a log replay
if after the unlink operation only one of the inodes involved is fsync'ed.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/foo
$ xfs_io -c fsync /mnt/dir/foo
# Keep an open file descriptor on our directory while we evict inodes.
# We just want to evict the file's inode, the directory's inode must not
# be evicted.
$ ( cd /mnt/dir; while true; do :; done ) &
$ pid=$!
# Wait a bit to give time to background process to chdir to our test
# directory.
$ sleep 0.5
# Trigger eviction of the file's inode.
$ echo 2 > /proc/sys/vm/drop_caches
# Unlink our file and fsync the parent directory. After a power failure
# we don't expect to see the file anymore, since we fsync'ed the parent
# directory.
$ rm -f $SCRATCH_MNT/dir/foo
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ ls /mnt/dir
foo
$
--> file still there, unlink not persisted despite explicit fsync on dir
Fix this by checking if the inode has the full_sync bit set in its runtime
flags as well, since that bit is set everytime an inode is loaded from
disk, or for other less common cases such as after a shrinking truncate
or failure to allocate extent maps for holes, and gets cleared after the
first fsync. Also consider the inode as possibly logged only if it was
last modified in the current transaction (besides having the full_fsync
flag set).
Fixes: 3a5f1d458a ("Btrfs: Optimize btree walking while logging inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.
This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:
setting 128 feature flag
updated:
setting incompat feature flag for RAID56 (0x80)
Signed-off-by: David Sterba <dsterba@suse.com>
gcc sometimes can't determine whether a variable has been initialized
when both the initialization and the use are conditional:
fs/btrfs/props.c: In function 'inherit_props':
fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
btrfs_block_rsv_release(fs_info, trans->block_rsv,
This code is fine. Unfortunately, I cannot think of a good way to
rephrase it in a way that makes gcc understand this, so I add a bogus
initialization the way one should not.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ gcc 8 and 9 don't emit the warning ]
Signed-off-by: David Sterba <dsterba@suse.com>
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Lockdep will report the following circular locking dependency:
WARNING: possible circular locking dependency detected
5.2.0-rc2-custom #24 Tainted: G O
------------------------------------------------------
btrfs/8631 is trying to acquire lock:
000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
but task is already holding lock:
000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&fs_info->tree_log_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x475/0xa00 [btrfs]
btrfs_commit_super+0x71/0x80 [btrfs]
close_ctree+0x2bd/0x320 [btrfs]
btrfs_put_super+0x15/0x20 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x80
deactivate_super+0x51/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x94/0xb0
exit_to_usermode_loop+0xd8/0xe0
do_syscall_64+0x210/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&fs_info->reloc_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x40d/0xa00 [btrfs]
btrfs_quota_enable+0x2da/0x730 [btrfs]
btrfs_ioctl+0x2691/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
lock_acquire+0xa7/0x190
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_qgroup_inherit+0x40/0x620 [btrfs]
create_pending_snapshot+0x9d7/0xe60 [btrfs]
create_pending_snapshots+0x94/0xb0 [btrfs]
btrfs_commit_transaction+0x415/0xa00 [btrfs]
btrfs_mksubvol+0x496/0x4e0 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0xa90/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
&fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->tree_log_mutex);
lock(&fs_info->reloc_mutex);
lock(&fs_info->tree_log_mutex);
lock(&fs_info->qgroup_ioctl_lock#2);
*** DEADLOCK ***
6 locks held by btrfs/8631:
#0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
#1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
#2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
#3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
#4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
#5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
[CAUSE]
Due to the delayed subvolume creation, we need to call
btrfs_qgroup_inherit() inside commit transaction code, with a lot of
other mutex hold.
This hell of lock chain can lead to above problem.
[FIX]
On the other hand, we don't really need to hold qgroup_ioctl_lock if
we're in the context of create_pending_snapshot().
As in that context, we're the only one being able to modify qgroup.
All other qgroup functions which needs qgroup_ioctl_lock are either
holding a transaction handle, or will start a new transaction:
Functions will start a new transaction():
* btrfs_quota_enable()
* btrfs_quota_disable()
Functions hold a transaction handler:
* btrfs_add_qgroup_relation()
* btrfs_del_qgroup_relation()
* btrfs_create_qgroup()
* btrfs_remove_qgroup()
* btrfs_limit_qgroup()
* btrfs_qgroup_inherit() call inside create_subvol()
So we have a higher level protection provided by transaction, thus we
don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
create_pending_snapshot() is already protected by transaction.
So the fix is to detect the context by checking
trans->transaction->state.
If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
context and no need to get the mutex.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported the following KASAN splat when running btrfs/048:
[ 1843.470920] ==================================================================
[ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
[ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
[ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
[ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1843.476322] Call Trace:
[ 1843.476674] dump_stack+0x7c/0xbb
[ 1843.477132] ? strncmp+0x66/0xb0
[ 1843.477587] print_address_description+0x114/0x320
[ 1843.478256] ? strncmp+0x66/0xb0
[ 1843.478740] ? strncmp+0x66/0xb0
[ 1843.479185] __kasan_report+0x14e/0x192
[ 1843.479759] ? strncmp+0x66/0xb0
[ 1843.480209] kasan_report+0xe/0x20
[ 1843.480679] strncmp+0x66/0xb0
[ 1843.481105] prop_compression_validate+0x24/0x70
[ 1843.481798] btrfs_xattr_handler_set_prop+0x65/0x160
[ 1843.482509] __vfs_setxattr+0x71/0x90
[ 1843.483012] __vfs_setxattr_noperm+0x84/0x130
[ 1843.483606] vfs_setxattr+0xac/0xb0
[ 1843.484085] setxattr+0x18c/0x230
[ 1843.484546] ? vfs_setxattr+0xb0/0xb0
[ 1843.485048] ? __mod_node_page_state+0x1f/0xa0
[ 1843.485672] ? _raw_spin_unlock+0x24/0x40
[ 1843.486233] ? __handle_mm_fault+0x988/0x1290
[ 1843.486823] ? lock_acquire+0xb4/0x1e0
[ 1843.487330] ? lock_acquire+0xb4/0x1e0
[ 1843.487842] ? mnt_want_write_file+0x3c/0x80
[ 1843.488442] ? debug_lockdep_rcu_enabled+0x22/0x40
[ 1843.489089] ? rcu_sync_lockdep_assert+0xe/0x70
[ 1843.489707] ? __sb_start_write+0x158/0x200
[ 1843.490278] ? mnt_want_write_file+0x3c/0x80
[ 1843.490855] ? __mnt_want_write+0x98/0xe0
[ 1843.491397] __x64_sys_fsetxattr+0xba/0xe0
[ 1843.492201] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1843.493201] do_syscall_64+0x6c/0x230
[ 1843.493988] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1843.495041] RIP: 0033:0x7fa7a8a7707a
[ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
[ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
[ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
[ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
[ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
[ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
[ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
[ 1843.505268] Allocated by task 3979:
[ 1843.505771] save_stack+0x19/0x80
[ 1843.506211] __kasan_kmalloc.constprop.5+0xa0/0xd0
[ 1843.506836] setxattr+0xeb/0x230
[ 1843.507264] __x64_sys_fsetxattr+0xba/0xe0
[ 1843.507886] do_syscall_64+0x6c/0x230
[ 1843.508429] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1843.509558] Freed by task 0:
[ 1843.510188] (stack is not available)
[ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
which belongs to the cache kmalloc-8 of size 8
[ 1843.514095] The buggy address is located 2 bytes inside of
8-byte region [ffff888111e369e0, ffff888111e369e8)
[ 1843.516524] The buggy address belongs to the page:
[ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
[ 1843.519993] flags: 0x4404000010200(slab|head)
[ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
[ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
[ 1843.524281] page dumped because: kasan: bad access detected
[ 1843.525936] Memory state around the buggy address:
[ 1843.526975] ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.528479] ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
[ 1843.531877] ^
[ 1843.533287] ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.534874] ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.536468] ==================================================================
This is caused by supplying a too short compression value ('lz') in the
test-case and comparing it to 'lzo' with strncmp() and a length of 3.
strncmp() read past the 'lz' when looking for the 'o' and thus caused an
out-of-bounds read.
Introduce a new check 'btrfs_compress_is_valid_type()' which not only
checks the user-supplied value against known compression types, but also
employs checks for too short values.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 272e5326c7 ("btrfs: prop: fix vanished compression property after failed set")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.
However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/foo
$ touch /mnt/dir/bar
# Do a direct IO write instead of a buffered write because with a
# buffered write we would need to make sure dealloc gets flushed and
# complete before we do the inode eviction later, and we can not do that
# from user space with call to things such as sync(2) since that results
# in a transaction commit as well.
$ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
# Keep the directory dir in use while we evict inodes. We want our file
# bar's inode to be evicted but we don't want our directory's inode to
# be evicted (if it were evicted too, we would not be able to reproduce
# the issue since the first fsync below, of file foo, would result in a
# transaction commit.
$ ( cd /mnt/dir; while true; do :; done ) &
$ pid=$!
# Wait a bit to give time for the background process to chdir.
$ sleep 0.1
# Evict all inodes, except the inode for the directory dir because it is
# currently in use by our background process.
$ echo 2 > /proc/sys/vm/drop_caches
# fsync file foo, which ends up persisting information about the parent
# directory because it is a new inode.
$ xfs_io -c fsync /mnt/dir/foo
# Rename bar, this results in logging that this inode exists (inode item,
# names, xattrs) because the parent directory is in the log.
$ mv /mnt/dir/bar /mnt/dir/baz
# Now fsync baz, which ends up doing absolutely nothing because of the
# rename operation which logged that the inode exists only.
$ xfs_io -c fsync /mnt/dir/baz
<power failure>
$ mount /dev/sdb /mnt
$ od -t x1 -A d /mnt/dir/baz
0000000
--> Empty file, data we wrote is missing.
Fix this by not updating last_sub_trans of an inode when we are logging
only that it exists and the inode was not yet logged since it was loaded
from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
return false for this scenario and make us log the inode. The logged_trans
of the inode is still always setsince that alone is used to track if names
need to be deleted as part of unlink operations.
Fixes: 257c62e1bc ("Btrfs: avoid tree log commit when there are no changes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The incompat bit for RAID56 is set either at mount time or automatically
when the profile is used by balance. The part where the bit is removed
is missing and can be unexpected or undesired when an older kernel is
needed.
This patch will drop the incompat bit after this command, assuming
that RAID5 profile is not used by system or metadata:
$ btrfs balance start -dconvert=raid5 /mnt
$ btrfs balance start -dconvert=raid1 /mnt
This will print "clearing 128 feature flag" to the system log.
The patch is safe for backporting to older kernels.
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.com>
The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Signed-off-by: David Sterba <dsterba@suse.com>
The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Turn the comment about required lock into an assertion.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The DAX flag only applies to files and directories, so don't let it get
set for other types of files.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Move the extent size hint checks that aren't xfs-specific to the vfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.
Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
sys_move_mount() crashes by dereferencing the pointer MNT_NS_INTERNAL,
a.k.a. ERR_PTR(-EINVAL), if the old mount is specified by fd for a
kernel object with an internal mount, such as a pipe or memfd.
Fix it by checking for this case and returning -EINVAL.
[AV: what we want is is_mounted(); use that instead of making the
condition even more convoluted]
Reproducer:
#include <unistd.h>
#define __NR_move_mount 429
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004
int main()
{
int fds[2];
pipe(fds);
syscall(__NR_move_mount, fds[0], "", -1, "/", MOVE_MOUNT_F_EMPTY_PATH);
}
Reported-by: syzbot+6004acbaa1893ad013f0@syzkaller.appspotmail.com
Fixes: 2db154b3ea ("vfs: syscall: Add move_mount(2) to move mounts around")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
bytes.
Before 07173c3ec2 ("block: enable multipage bvecs"), one bio can
include very limited pages, and usually at most 256, so the fs bio
size won't be bigger than 1M bytes most of times.
Since we support multi-page bvec, in theory one fs bio really can
be added > 1M pages, especially in case of hugepage, or big writeback
with too many dirty pages. Then there is chance in which .bi_size
is overflowed.
Fixes this issue by using bio_full() to check if the added segment may
overflow .bi_size.
Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec2 ("block: enable multipage bvecs")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
VTS5zps=
=huBY
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc6' into for-5.3/block
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.
* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...
There are no concerns about locking during the selftests so the locks
are not necessary, but following patches will add lockdep assertions to
add_extent_mapping so this is needed in tests too.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function has a lot of return values and specific conventions making
it cumbersome to understand what's returned. Have a go at documenting
its parameters and return values.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently find_first_clear_extent_bit always returns a range whose
starting value is >= passed 'start'. This implicit trimming behavior is
somewhat subtle and an implementation detail.
Instead, this patch modifies the function such that now it always
returns the range which contains passed 'start' and has the given bits
unset. This range could either be due to presence of existing records
which contains 'start' but have the bits unset or because there are no
records that contain the given starting offset.
This patch also adds test cases which cover find_first_clear_extent_bit
since they were missing up until now.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the first megabyte on a device housing a btrfs filesystem is
exempt from allocation and trimming. Currently this is not a problem
since 'start' is set to 1M at the beginning of btrfs_trim_free_extents
and find_first_clear_extent_bit always returns a range that is >= start.
However, in a follow up patch find_first_clear_extent_bit will be
changed such that it will return a range containing 'start' and this
range may very well be 0...>=1M so 'start'.
Future proof the sole user of find_first_clear_extent_bit by setting
'start' after the function is called. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The filename:line format is commonly understood by editors and can be
copy&pasted more easily than the current format.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_print_data_csum_error() still assumed checksums to be 32 bit in
size. Make it size agnostic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_csum_data() relied on the crc32c() wrapper around the
crypto framework for calculating the CRCs.
As we have our own crypto_shash structure in the fs_info now, we can
directly call into the crypto framework without going trough the wrapper.
This way we can even remove the btrfs_csum_data() and btrfs_csum_final()
wrappers.
The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre:
crc32c"), which was previously provided by LIBCRC32C config option doing
the same.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add boilerplate code for directly including the crypto framework. This
helps us flipping the switch for new algorithms.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have already checked for a valid checksum type before
calling btrfs_check_super_csum(), it can be simplified even further.
While at it get rid of the implicit size assumption of the resulting
checksum as well.
This is a preparation for changing all checksum functionality to use the
crypto layer later.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have factorerd out the superblock checksum type validation,
we can check for supported superblock checksum types before doing the
actual validation of the superblock read from disk.
This leads the path to further simplifications of
btrfs_check_super_csum() later on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs is only supporting CRC32C as checksumming algorithm. As
this is about to change provide a function to validate the checksum type
in the superblock against all possible algorithms.
This makes adding new algorithms easier as there are fewer places to
adjust when adding new algorithms.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a small helper for btrfs_print_data_csum_error() which formats the
checksum according to it's type for pretty printing.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ shorten macro name ]
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has the implicit assumption that a checksum in compressed_bio is 4
bytes. While this is true for CRC32C, it is not for any other checksum.
Change the data type to be a byte array and adjust loop index calculation
accordingly.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.
Change the data type to be a byte array and adjust loop index
calculation accordingly.
This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The CRC checksum in the free space cache is not dependant on the super
block's csum_type field but always a CRC32C.
So use btrfs_crc32c() and btrfs_crc32c_final() instead of
btrfs_csum_data() and btrfs_csum_final() for computing these checksums.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9678c54388 ("btrfs: Remove custom crc32c init code") removed
the btrfs_crc32c() function, because it was a duplicate of the crc32c()
library function we already have in the kernel.
Resurrect it as a shim wrapper over crc32c() to make following
transformations of the checksumming code in btrfs easier.
Also provide a btrfs_crc32_final() to ease following transformations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_test_for_metadata() directly calls the crc32c() library function
for calculating the CRC32C checksum, but then uses btrfs_csum_final() to
invert the result.
To ease further refactoring and development around checksumming in BTRFS
convert to calling btrfs_csum_data(), which is a wrapper around
crc32c().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script can cause unexpected fsync failure:
#!/bin/bash
dev=/dev/test/test
mnt=/mnt/btrfs
mkfs.btrfs -f $dev -b 512M > /dev/null
mount $dev $mnt -o nospace_cache
# Prealloc one extent
xfs_io -f -c "falloc 8k 64m" $mnt/file1
# Fill the remaining data space
xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
sync
# Write into the prealloc extent
xfs_io -c "pwrite 1m 16m" $mnt/file1
# Reflink then fsync, fsync would fail due to ENOSPC
xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
umount $dev
The fsync fails with ENOSPC, and the last page of the buffered write is
lost.
[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
So write into shared extent must be COWed even only part of the extent
is shared.
So for above script we have:
- fallocate
Create a preallocated extent where we can do NOCOW write.
- fill all the remaining data and unallocated space
- buffered write into preallocated space
As we have not enough space available for data and the extent is not
shared (yet) we fall into NOCOW mode.
- reflink
Now part of the large preallocated extent is shared, later write
into that extent must be COWed.
- fsync triggers writeback
But now the extent is shared and therefore we must fallback into COW
mode, which fails with ENOSPC since there's not enough space to
allocate data extents.
[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.
The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first thing code does in check_can_nocow is trying to block
concurrent snapshots. If this fails (due to snpashot already being in
progress) the function returns ENOSPC which makes no sense. Instead
return EAGAIN. Despite this return value not being propagated to callers
it's good practice to return the closest in terms of semantics error
code. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case no cached_state argument is passed to
btrfs_lock_and_flush_ordered_range use one locally in the function. This
optimises the case when an ordered extent is found since the unlock
function will be able to unlock that state directly without searching
for it again.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a certain idiom used in multiple places in btrfs' codebase,
dealing with flushing an ordered range. Factor this in a separate
function that can be reused. Future patches will replace the existing
code with that function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the context of btrfs_run_delalloc_range(), we haven't started/joined
a transaction, thus even something went wrong, we can't and won't abort
transaction, thus no way to make the fs RO.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just add a safe net for btrfs_space_info member updating.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper lacks the btrfs_ prefix and the parameter is the raw
blockgroup type, so none of the callers has to do the flags -> index
conversion.
Signed-off-by: David Sterba <dsterba@suse.com>
The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just
small numbers so we don't have to use ints. New size is 7 * 32 = 224,
saving 3 cachelines.
Signed-off-by: David Sterba <dsterba@suse.com>
Factor the sequence of ifs to a helper, the 'data stripes' here means
the number of stripes without redundancy and parity.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace open coded list of the profiles by selecting them from the
raid_attr table. The criteria are now more explicit, we need profiles
that have more than 1 copy of the data or can reconstruct the data with
a missing device.
Signed-off-by: David Sterba <dsterba@suse.com>
Iterate over the table and gather all allowed profiles for a given
number of devices, instead of open coding.
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
The minimum number of devices for RAID5 is 2, though this is only a bit
expensive RAID1, and for RAID6 it's 3, which is a triple copy that works
only 3 devices.
mkfs.btrfs allows that and mounting such filesystem also works, so the
conversion via balance filters is inconsistent with the others and we
should not prevent it.
Signed-off-by: David Sterba <dsterba@suse.com>
The list of profiles in btrfs_chunk_max_errors lists DUP as a profile
DUP able to tolerate 1 device missing. Though this profile is special
with 2 copies, it still needs the device, unlike the others.
Looking at the history of changes, thre's no clear reason why DUP is
there, functions were refactored and blocks of code merged to one
helper.
d20983b40e Btrfs: fix writing data into the seed filesystem
- factor code to a helper
de11cc12df Btrfs: don't pre-allocate btrfs bio
- unrelated change, DUP still in the list with max errors 1
a236aed14c Btrfs: Deal with failed writes in mirrored configurations
- introduced the max errors, leaves DUP and RAID1 in the same group
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code was first introduced in 5f39d397df ("Btrfs: Create
extent_buffer interface for large blocksizes") and the function was
named btrfs_unlink_trans. It later got renamed to __btrfs_unlink_inode
and finally commit 16cdcec736 ("btrfs: implement delayed inode items
operation") changed the way inodes are deleted and obviated the need for
those two members.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ replace changelog by Nikolay's version ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from 312c89fbca ("btrfs: cleanup btrfs_mount()
using btrfs_mount_root()"), the mode was used for opening devices that's
not done here anymore.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function do_trimming(), block_group->lock should be unlocked first,
as the locks should be released in the reverse order. This does not
cause problems but should follow the best practices.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Under certain conditions, we could have strange file extent item in log
tree like:
item 18 key (69599 108 397312) itemoff 15208 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 18446744073709547520 ram 18446744073709547520
The num_bytes + ram_bytes overflow 64 bit type.
For num_bytes part, we can detect such overflow along with file offset
(key->offset), as file_offset + num_bytes should never go beyond u64.
For ram_bytes part, it's about the decompressed size of the extent, not
directly related to the size.
In theory it is OK to have a large value, and put extra limitation
on RAM bytes may cause unexpected false alerts.
So in tree-checker, we only check if the file offset and num bytes
overflow.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is already done in btrfs_init_dev_replace_tgtdev which is the first
phase of device replace, called before doing scrub. During that time
exclusive lock is held. Additionally btrfs_fs_device::commit_total_bytes
is always set based on the size of the underlying block device which
shouldn't change once set. This makes the 2nd assignment of the variable
in the finishing phase redundant.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Part of device replace involves writing an item to the device root
containing information about pending replace operations. Currently space
for this item is not being explicitly reserved so this works thanks to
presence of global reserve. While not fatal it's not a good practice.
Let's be explicit about space requirement of device replace and reserve
space when starting the transaction.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 branches which goto leave label with need_unlock set
to true. Essentially need_unlock is used as a substitute for directly
calling up_write. Since the branches needing this are only 2 and their
context is not that big it's more clear to just call up_write where
required. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_dev_replace_tgtdev reads certain values from the source
device (such as commit_total_bytes) which are updated during transaction
commit. Currently this function is called before committing any pending
transaction, leading to possibly reading outdated values.
Fix this by moving the function below the transaction commit, at this
point the EXCL_OP bit it set hence once transaction is complete the
total size of the device cannot be changed (it's usually changed by
resize/remove ops which are blocked).
Fixes: 9e271ae27e ("Btrfs: kernel operation should come after user input has been verified")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This WARN_ON can never trigger because src_device cannot be null.
btrfs_find_device_by_devspec always returns either an error or a valid
pointer to the device. Just remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in holding btrfs_fs_devices::device_list_mutex
while initialising fields of the not-yet-published device. Instead,
hold the mutex only when the newly initialised device is being
published. I think holding device_list_mutex here is redundant
altogether, because at this point BTRFS_FS_EXCL_OP is set which
prevents device removal/addition/balance/resize to occur.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using sync_blockdev makes it plain obvious what's happening. No
functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_check_shared looks up parents of a given extent and uses ulists
for that. These are allocated and freed repeatedly. Preallocation in the
caller will avoid the overhead and also allow us to use the GFP_KERNEL
as it is happens before the extent locks are taken.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, there's only check for fast crc32c implementation on X86,
based on the CPU flags. This is used to decide if checksumming should be
offloaded to worker threads or can be calculated by the caller.
As there are more architectures that implement a faster version of
crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw
cards.
The detection is based on driver name, all generic C implementations
contain 'generic', while the specialized versions do not. Alternatively
the priority could be used, but this is not currently provided by the
crypto API.
The flag is set per-filesystem at mount time and used for the offloading
decisions.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using @sign to determine whether we're adding or subtracting.
Even it only has 3 callers, it's still (and in fact already caused
problem in the past) confusing to use.
Refactor add_pinned_bytes() to add_pinned_bytes() and sub_pinned_bytes()
to explicitly show what we're doing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a magic flag for xfs_trans_alloc, just ensure all callers
that can't relclaim through the file system use memalloc_nofs_save to
set the per-task nofs flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Compare the block layer status directly instead of converting it to
an errno first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no real problem merging ioends that go beyond i_size into an
ioend that doesn't. We just need to move the append transaction to the
base ioend. Also use the opportunity to use a real error code instead
of the magic 1 to cancel the transactions, and write a comment
explaining the scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The fail argument is long gone, update the comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we pass pages through an iov_iter we always already have a reference
in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use bio_release_pages instead of duplicating it.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use bio_release_pages instead of duplicating it.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use bio_release_pages instead of duplicating it.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use bio_release_pages instead of duplicating it.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Account XArray nodes for the page cache to the appropriate cgroup
(Johannes Weiner)
Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)
Add a test for xa_insert() (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAl0WuKsACgkQDpNsjXcp
gj73zgf9Eb477PuwYZpFBA9ZxI5v/6WyqbaWXKdqEhotARgIUuv1CfVnkt1IJE6P
Z3QCRABZ3pIKHgIErJN53B7AdvdONUO4Xf9VFBqmxeWE7F9L3sROOpXc8IrR26kV
hITQn8mwgacNQ8mLtQmcSFaCVC2E7yVNBhVd5zmcA6jNIAFsOJcP06KLJTe94OXe
AB9TJvswxpzAEX8emHQ/a1SFBNZWJ7b53hBcu8CJn8CuGDxmo1/+qqoRyNY+WrDO
OohFk2u1j6Esfc6j0k+Akt8mEFyfU2oxFfv5MjL0KYEyrHoU84eZljFGgf7rQqGj
fqH9RO8J8eoj4D/3XaLL5QYRLIxRaw==
=AXZy
-----END PGP SIGNATURE-----
Merge tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax
Pull XArray fixes from Matthew Wilcox:
- Account XArray nodes for the page cache to the appropriate cgroup
(Johannes Weiner)
- Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)
- Add a test for xa_insert() (Matthew Wilcox)
* tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add check_insert
idr: Fix idr_get_next race with idr_remove
mm: fix page cache convergence regression
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL
mm, swap: fix THP swap out
fork,memcg: alloc_thread_stack_node needs to set tsk->stack
MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info
mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
initramfs: fix populate_initrd_image() section mismatch
mm/oom_kill.c: fix uninitialized oc->constraint
mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
signal: remove the wrong signal_pending() check in restore_user_sigmask()
fs/binfmt_flat.c: make load_flat_shared_library() work
mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
fs/proc/array.c: allow reporting eip/esp for all coredumping threads
mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address
Stable bugfixes:
- SUNRPC: Fix up calculation of client message length # 5.1+
- NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O # 4.8+
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl0Wf3EACgkQ18tUv7Cl
QOs2ORAA5/CXFa471jUldOsHejxfFoddFBkuqf8qZ1AF3TZdFuITAsq+xydxfO5U
hYzUUlOTKedEi+ISYLFs1tjU/nYRQJv7fFZxVwq6uDZ53Z/doiMLAIR67Eq7EcTY
KBWA9zdldnBzb0S87+hkbmaNPR5pjqxBzLEfMmOQEAAh5pSGf5YSeUNTXLGj4wBd
iXf25o1VSjUmNpSHaA3KsrqTJ4mJ7+i/17Iny1c4xRgZbJtoTm44DpceHCheJpbl
DymRSgjSr0vFjJbufcKkbF2OPp1ZsnkDiKyJmZzgPOa3+TMGzisU5yiASoac6D+j
gs426yEz9rvR/TMZtFS05nfu2clKuS8foLGwZelJ7XjQSXJgObCb4xf97jLIOWNb
J+BWwsTmUIQS+fMUQDA+rlbyepJ+skVZpbjmUy+/Uy52oqtnYK6uTD469NdmxBwr
7z2pnCUjJFTqo6BHeCQgR5XlSt1MGDByamcVAWONS+9zJttRhfUOjq0PIOLSrsBK
5zRzJxtBoYLwP5py3zKAeV9RcvDNSgh5U6P0hhFRtHfqMUmtGeA58nNND2S6Qm3/
vAB7WZL0aVSvc3zpz7qdctitMESQNspCkMooAp/EoIime3YkqKCS+AgED9jKLhJR
/5eqtr6tehh6A4dshzSlDF7cFrKyUd+ulS0IN8vt1V2TYgOQmbY=
=FATk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull two more NFS client fixes from Anna Schumaker:
"These are both stable fixes.
One to calculate the correct client message length in the case of
partial transmissions. And the other to set the proper TCP timeout for
flexfiles"
* tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O
SUNRPC: Fix up calculation of client message length
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl0WLe4THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizPrB/4tNUS8J9mW9Zd3xLAzZmwjq+WAfCV8
wp3IjBHCgvn9SmTYOJtozjTLJVlmeGNVyrCaWbtzQ2YLKvyBTCUF4kg9EG7FMX9a
ixzlHb2+Wu46LYWiA7jhUnoKNMMl1swm01BOvfmGprSwV70BAEF0i2/D7WHikolX
rgcwGb58vUMmXQ1VGfIO9Pox2a8jaZNj82BZnDniMDxetZ5sRsZXGy43s14zC6Lt
YnwDT70Y7+Pr9SwHMA5bnZ8kCtQpr0qAHmDVhEd965Io1XZ+2/EHF5IwqK0xGg+e
KUQdRyhMWjIGG34SWMt5tbT+9Lzeju4CAka9NPSJ1tRtFnk1AvpILbnB
=TpFR
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A small fix for a potential -rc1 regression from Jeff"
* tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client:
ceph: fix ceph_mdsc_build_path to not stop on first component
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0WM9YQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmkyEADVPjXlIZETBpAl/oK/StNc1NMdfgBiWaX7
kQHbFu3V4soDpvR8iQvMVyFc7dUpwo9lmgxIOcZSfdCf/ciJ/G4trhH4UljXfRsj
2vdKV3rZXragrclN0zGtW90sBBYxSilaezzRQbnnXjEgGaHFkeJJR3xW00UMoGrm
GDO2gSQdhDKqhJtKjiCASkyN9uWMkcLFdsGErPgA6e4S3NTbaLKaY/xFUCcMF7aX
N1aYkIfdyl38QUU/N+5WLgiJYHkiZNqcrJ+a5aECioqqiNh9ST+UR1jCgo7tlt4h
b3Gb5mxP0CPUuTh3VQD8GHCaPzDsxUIxThJkz5aih3M9NEQmm5Du0GDChaDuMoUR
zyFT/Yl4JfeO93mlpxGUyC5WyFCQdj0QOBuyxInCchvJC5kbpRflMuKt+xRYlSqg
331njdykyKkgutagLzzTME38RPUbttZVmbc6K422PXKkYW+FOlS352FZpl5qxDOu
5+ihOXOLvO09VXu6kcC5UH4Yi6nuGYDS95oIZhJ0OODx10xnKSE4ZozlPXAEreAR
NVJN7vbHVqLnphuplRK9Kh0VngdIhLkeTsUxaTnX6UQSioHPDJPqPP5nfSu9Xkyo
e+2UAXkfVjnw45jAu8Mrsu0KhabCB5Pde8Jk+kmqPcuWXQEN5OHqeA09vtvKj81J
lIagz1NZxw==
=WzXj
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190628' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Just two small fixes.
One from Paolo, fixing a silly mistake in BFQ. The other one is from
me, ensuring that we have ->file cleared in the io_uring request a bit
earlier. That avoids a use-before-free, if we encounter an error
before ->file is assigned"
* tag 'for-linus-20190628' of git://git.kernel.dk/linux-block:
block, bfq: fix operator in BFQQ_TOTALLY_SEEKY
io_uring: ensure req->file is cleared on allocation
This is the minimal fix for stable, I'll send cleanups later.
Commit 854a6ed568 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.
Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.
Eric said:
: For clarity. I don't think this is required by posix, or fundamentally to
: remove the races in select. It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented. The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?
Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed568 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_flat_shared_library() is broken: It only calls load_flat_file() if
prepare_binprm() returns zero, but prepare_binprm() returns the number of
bytes read - so this only happens if the file is empty.
Instead, call into load_flat_file() if the number of bytes read is
non-negative. (Even if the number of bytes is zero - in that case,
load_flat_file() will see nullbytes and return a nice -ENOEXEC.)
In addition, remove the code related to bprm creds and stop using
prepare_binprm() - this code is loading a library, not a main executable,
and it only actually uses the members "buf", "file" and "filename" of the
linux_binprm struct. Instead, call kernel_read() directly.
Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com
Fixes: 287980e49f ("remove lots of IS_ERR_VALUE abuses")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0a1eb2d474 ("fs/proc: Stop reporting eip and esp in /proc/PID/stat")
stopped reporting eip/esp and fd7d56270b ("fs/proc: Report eip/esp in
/prod/PID/stat for coredumping") reintroduced the feature to fix a
regression with userspace core dump handlers (such as minicoredumper).
Because PF_DUMPCORE is only set for the primary thread, this didn't fix
the original problem for secondary threads. Allow reporting the eip/esp
for all threads by checking for PF_EXITING as well. This is set for all
the other threads when they are killed. coredump_wait() waits for all the
tasks to become inactive before proceeding to invoke a core dumper.
Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de
Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de
Fixes: fd7d56270b ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Jan Luebbe <jlu@pengutronix.de>
Tested-by: Jan Luebbe <jlu@pengutronix.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Properly allocate the space for the bio_vecs instead of just one byte
per bio_vec.
Fixes: 79b54d9bfc ("xfs: use bios directly to write log buffers")
Reported-by: syzbot+b75afdbe271a0d7ac4f6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.
nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this. I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Link every newly allocated writeback bio to cgroup pointed to by the
writeback control structure, and charge every byte written back to it.
Tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move setting up operation and write hint to xfs_alloc_ioend, and
then just copy over all needed information from the previous bio
in xfs_chain_bio and stop passing various parameters to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we're writing out a fresh new AG, make sure that we don't list an
internal log as free and that we create the rmap for the region. growfs
never does this, but we will need it when we hook up mkfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Refactor the code that populates the free space btrees of a new AG so
that we can avoid code duplication once things start getting
complicated.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_alloc_ag_vextent_small() doesn't update the output parameters in
the event of an AGFL allocation. Instead, it updates the
xfs_alloc_arg structure directly to complete the allocation.
Update both args and the output params to provide consistent
behavior for future callers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The small allocation helper is implemented in a way that is fairly
tightly integrated to the existing allocation algorithms. It expects
a cntbt cursor beyond the end of the tree, attempts to locate the
last record in the tree and only attempts an AGFL allocation if the
cntbt is empty.
The upcoming generic algorithm doesn't rely on the cntbt processing
of this function. It will only call this function when the cntbt
doesn't have a big enough extent or is empty and thus AGFL
allocation is the only remaining option. Tweak
xfs_alloc_ag_vextent_small() to handle a NULL cntbt cursor and skip
the cntbt logic. This facilitates use by the existing allocation
code and new code that only requires an AGFL allocation attempt.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the small allocation helper further up in the file to avoid the
need for a function declaration. The remaining declarations will be
removed by followup patches. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_alloc_ag_vextent_small() is kind of a mess. Clean it up in
preparation for future changes. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all bmap item related code together.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all rmap item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all the refcount item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Keep all the extree item related code together in one file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no good reason to keep these two functions separate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no good reason to keep these two functions separate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no good reason to keep these two functions separate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no good reason to keep these two functions separate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the hand grown linked list handling and cil context attachment
with the standard list_head structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The cast is not type safe, and we can just dereference the first
member instead to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We have various items that are released from ->iop_comitting. Add a
flag to just call ->iop_release from the commit path to avoid tons
of boilerplate code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The iop_unlock method is called when comitting or cancelling a
transaction. In the latter case, the transaction may or may not be
aborted. While there is no known problem with the current code in
practice, this implementation is limited in that any log item
implementation that might want to differentiate between a commit and a
cancellation must rely on the aborted state. The aborted bit is only
set when the cancelled transaction is dirty, however. This means that
there is no way to distinguish between a commit and a clean transaction
cancellation.
For example, intent log items currently rely on this distinction. The
log item is either transferred to the CIL on commit or released on
transaction cancel. There is currently no possibility for a clean intent
log item in a transaction, but if that state is ever introduced a cancel
of such a transaction will immediately result in memory leaks of the
associated log item(s). This is an interface deficiency and landmine.
To clean this up, replace the iop_unlock method with an iop_release
method that is specific to transaction cancel. The existing
iop_committing method occurs at the same time as iop_unlock in the
commit path and there is no need for two separate callbacks here.
Overload the iop_committing method with the current commit time
iop_unlock implementations to eliminate the need for the latter and
further simplify the interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While commiting items looks very similar to freeing them on error it is
a different operation, and they will diverge a bit soon.
Split out the commit case from xfs_trans_free_items, inline it into
xfs_log_commit_cil and give it a separate trace point.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This method should never be called, so don't waste code on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just check if they are present first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just pass a straight bool aborted instead of abusing XFS_LI_ABORTED as a
flag in function parameters.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We need to derive the mount pointer from a buffer in a lot of place.
Add a direct pointer to short cut the pointer chasing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This field is now always idential to b_length.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the log code doesn't abuse this field any more we can
declare it as a struct xfs_buf_log_item pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that the log code uses bios directly we can drop various special
cases in the buffer cache code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we don't use struct xfs_buf to hold log recovery buffer rename
the related functions and variables to just talk of a buffer instead of
using the bp name that we usually use for xfs_buf related functionality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_buf structure is basically used as a glorified container for
a memory allocation in the log recovery code. Replace it with a
call to kmem_alloc_large and a simple abstraction to read into or
write from it synchronously using chained bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This simplifies both the helper and the callers. We lost a bit of
size sanity checking, but that is already covered by KASAN if needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the workqueue used for log I/O completions from struct xfs_mount
to struct xlog to keep it self contained in the log code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: destroy the log workqueue after ensuring log ios are done]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently the XFS logging code uses the xfs_buf structure and
associated APIs to write the log buffers to disk. This requires
various special cases in the log code and is generally not very
optimal.
Instead of using a buffer just allocate a kmem_alloc_larger region for
each log buffer, and use a bio and bio_vec array embedded in the iclog
structure to write the buffer to disk. This also allows for using
the bio split and chaining case to deal with the case of a log
buffer wrapping around the end of the log.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: don't split if/else with an #endif]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the slightly shorter way to get at the buftarg for the log device
wherever we can in the log and log recovery code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The only caller unconditionally passes true here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Just a small bit of code tidying up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split out another self-contained bit of code from xlog_sync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split out a self-contained chunk of code from xlog_sync that calculates
the split offset for an iclog that wraps the log end and bumps the
cycles for the second half.
Use the chance to bring some sanity to the variables used to track the
split in xlog_sync by not changing the count variable, and instead use
split as the offset for the split and use those to calculate the
sizes and offsets for the two write buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the not very useful xlog_bdstrat wrapper with a new version that
that takes care of all the common logic for writing log buffers. Use
the opportunity to avoid overloading the buffer address with the log
relative address, and to shed the unused return value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we have to split a log write because it wraps the end of the log we
can't just use REQ_PREFLUSH to flush before the first log write,
as the writes might get reordered somewhere in the I/O stack. Issue
a manual flush in that case so that the ordering of the two log I/Os
doesn't matter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This value is the only flag in ic_state, which we otherwise use as
a state. Switch it to a new debug-only field and also report and
actual error in the buffer in the I/O completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reformat xlog_get_lowest_lsn to our usual style.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We don't really need all the messy branches in the function, as it
really does three things, out of which 2 are common for all branches:
1) set up mount point log buffer size and count values if not already
done from mount options
2) calculate the number of log headers
3) set up all the values in struct xlog based on the above
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This field is never used, so we can simply kill it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rename the function to kmem_to_page and move it to kmem.h together
with our kmem_large allocator that may either return kmalloced or
vmalloc pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Assining a numerical value that is not close to the flags
defined near by is just asking for conflicts later on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode geometry structure isn't related to ondisk format; it's
support for the mount structure. Move it to xfs_shared.h.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix a typo where we're confusing the default TCP retrans value
(NFS_DEF_TCP_RETRANS) for the default TCP timeout value.
Fixes: 15d03055cf ("pNFS/flexfiles: Set reasonable default ...")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We never parsed/returned any data from .get_link() when the object is a windows reparse-point
containing a symlink. This results in the VFS layer oopsing accessing an uninitialized buffer:
...
[ 171.407172] Call Trace:
[ 171.408039] readlink_copy+0x29/0x70
[ 171.408872] vfs_readlink+0xc1/0x1f0
[ 171.409709] ? readlink_copy+0x70/0x70
[ 171.410565] ? simple_attr_release+0x30/0x30
[ 171.411446] ? getname_flags+0x105/0x2a0
[ 171.412231] do_readlinkat+0x1b7/0x1e0
[ 171.412938] ? __ia32_compat_sys_newfstat+0x30/0x30
...
Fix this by adding code to handle these buffers and make sure we do return a valid buffer
to .get_link()
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.
In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.
The aquantia driver conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAl0UnRoACgkQjrBW1T7s
sS1T0w/+PFooDZNaKJkhJCGm0XyRDYmmuivEX9ydUR1x9/doRbDZTqfjsQBJLoVK
PulxDiuFbQWXzhBJFEMuU6YBR2fjFqUGsXz5qAXPB0zaahWcSY/0Y8VCU/PKq7A6
3oJPl/lYwYkLTYUKsnN08hByosUA7WeRQRAxbSFWdCTlUfIw72mDhprMGjJIVAlu
snLA5lUoy7hyoFdXR5qNhYAcX8sASmi01hXhdnsKMOv4z2Vb5NoQsgqL1W8tAnsf
BdJKL82Qd7vWQahlbOtur46aeJAL2ukGSTskuA2jOQqsKxmpos+hWq36gToq7usa
XgPii0Rz7/2s6ZvhmxV5kmzqHylT9giU1DxWybSVo9IZBsU2i1o9DV+yBY50tr45
s0bmpSA/u4DP2uT8oRvh47LbDqiQFA8dyVWQKE25smSdjekuZHTO0tgXf8mwC2CW
hDci4z+ONOyqIQyFrhP7UaKuSK6tAAUbYKtXUIN6rnuq1FjuTA2+wtIlOPZuDZQ2
yrsSUefh4/sFMBSAgoGTg9f+PiCejBMKcxoqhU2/27mvkiInAyDPfoc4oGQcinOy
OVX3B0A8B88l26sDkWdv15d92E1GKzZLj8h66TlYwDpN+seevftKtpblZ9fJWsSf
0NejoMV/GcA/KsAp1sxqWwouRob8H6pbGXWb97DYRA2IVyjK3q4=
=APjA
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190627' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux
Pull pidfd fixes from Christian Brauner:
"Userspace tools and libraries such as strace or glibc need a cheap and
reliable way to tell whether CLONE_PIDFD is supported. The easiest way
is to pass an invalid fd value in the return argument, perform the
syscall and verify the value in the return argument has been changed
to a valid fd.
However, if CLONE_PIDFD is specified we currently check if pidfd == 0
and return EINVAL if not.
The check for pidfd == 0 was originally added to enable us to abuse
the return argument for passing additional flags along with
CLONE_PIDFD in the future.
However, extending legacy clone this way would be a terrible idea and
with clone3 on the horizon and the ability to reuse CLONE_DETACHED
with CLONE_PIDFD there's no real need for this clutch. So remove the
pidfd == 0 check and help userspace out.
Also, accordig to Al, anon_inode_getfd() should only be used past the
point of no failure and ksys_close() should not be used at all since
it is far too easy to get wrong. Al's motto being "basically, once
it's in descriptor table, it's out of your control". So Al's patch
switches back to what we already had in v1 of the original patchset
and uses a anon_inode_getfile() + put_user() + fd_install() sequence
in the success path and a fput() + put_unused_fd() in the failure
path.
The other two changes should be trivial"
* tag 'for-linus-20190627' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
proc: remove useless d_is_dir() check
copy_process(): don't use ksys_close() on cleanups
samples: make pidfd-metadata fail gracefully on older kernels
fork: don't check parent_tidptr with CLONE_PIDFD
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXRMn5vu3V2unywtrAQICpA/+IIINk6MJVQDzGhOnvWrbGdPnOdJEUyLN
B9U4bLZJRg/j+Sqodn+fXIfsEO4FQflkSJD+xoBi4pzBZcr0xkLUVOog/1S7dv4J
bPVT9p2f3ITNiatmisOrUe1InuHa6Wb/cUnQaLLRhd7NqbawKGRQG4tv4CGwKn67
dJIOOm/iTCs1ACES4C5QOpU7/DWK38Pn3BbnN21bFzDgfbtbdDTaFFkhFtXy78oB
Gcj5g+ULpkKBcuJThFuJUPZ9E4qICNZR4kJXEULSvykDDRzluhJmQ+v8btm6NJsq
hMqTrT9M2y114V1OqXj3me7tA6wOEAfTQ0WzpzF2SmyFQKnSly/EkWc4HZXFD/8O
BczCcABUbuKNE/pJSELx6k1M0+00QfeLcjHPc6joZFCni3lMdYWOncn/syyHw5P+
rc9JQsy3+dLcFsaVQ5eGmX6NDc70dCrAlS6MllIzSBcwAVCctTKwm0meaSW6B2y6
VymPy+cqi1RxMKyiQ0hAeU7Xe6yqFcl6rtonfCQqRLxkfzrCXkDp6/ELOXBzDft1
ey6+N3WsmWW7YSPuM/SIZKV66rshlflj0w+FRluZEEAF1NYeYqXUDvK/S8KC9kPG
AXUDvhI+tBpxg1AVz94JN714VmkbY23xV0g44eQsdqSQm2YvsxiFCSWZZ6L/KEWe
kWQc6BGDCB0=
=YTdG
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190620' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"The in-kernel AFS client has been undergoing testing on opendev.org on
one of their mirror machines. They are using AFS to hold data that is
then served via apache, and Ian Wienand had reported seeing oopses,
spontaneous machine reboots and updates to volumes going missing. This
patch series appears to have fixed the problem, very probably due to
patch (2), but it's not 100% certain.
(1) Fix the printing of the "vnode modified" warning to exclude checks
on files for which we don't have a callback promise from the
server (and so don't expect the server to tell us when it
changes).
Without this, for every file or directory for which we still have
an in-core inode that gets changed on the server, we may get a
message logged when we next look at it. This can happen in bulk
if, for instance, someone does "vos release" to update a R/O
volume from a R/W volume and a whole set of files are all changed
together.
We only really want to log a message if the file changed and the
server didn't tell us about it or we failed to track the state
internally.
(2) Fix accidental corruption of either afs_vlserver struct objects or
the the following memory locations (which could hold anything).
The issue is caused by a union that points to two different
structs in struct afs_call (to save space in the struct). The call
cleanup code assumes that it can simply call the cleanup for one
of those structs if not NULL - when it might be actually pointing
to the other struct.
This means that every Volume Location RPC op is going to corrupt
something.
(3) Fix an uninitialised spinlock. This isn't too bad, it just causes
a one-off warning if lockdep is enabled when "vos release" is
called, but the spinlock still behaves correctly.
(4) Fix the setting of i_block in the inode. This causes du, for
example, to produce incorrect results, but otherwise should not be
dangerous to the kernel"
* tag 'afs-fixes-20190620' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix setting of i_blocks
afs: Fix uninitialised spinlock afs_volume::cb_break_lock
afs: Fix vlserver record corruption
afs: Fix over zealous "vnode modified" warnings
When we truncate a short write to have it retried, pass the truncated
length to the page_done callback instead of the full length.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This effectively reverts a6d639da63 ("fs: factor out a
__generic_write_end helper") as we now open code what is left of that
helper in iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Marking the inode dirty for each page copied into the page cache can be
very inefficient for file systems that use the VFS dirty inode tracking,
and is completely pointless for those that don't use the VFS dirty inode
tracking. So instead, only set an iomap flag when changing the in-core
inode size, and open code the rest of __generic_write_end.
Partially based on code from Christoph Hellwig.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split. This will also allow a
greater range of subjects to represented.
============
WHY DO THIS?
============
The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.
For SETATTR, this includes actions that are about controlling access to a
key:
(1) Changing a key's ownership.
(2) Changing a key's security information.
(3) Setting a keyring's restriction.
And actions that are about managing a key's lifetime:
(4) Setting an expiry time.
(5) Revoking a key.
and (proposed) managing a key as part of a cache:
(6) Invalidating a key.
Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.
Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission. It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.
As for SEARCH permission, that currently covers:
(1) Finding keys in a keyring tree during a search.
(2) Permitting keyrings to be joined.
(3) Invalidation.
But these don't really belong together either, since these actions really
need to be controlled separately.
Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.
===============
WHAT IS CHANGED
===============
The SETATTR permission is split to create two new permissions:
(1) SET_SECURITY - which allows the key's owner, group and ACL to be
changed and a restriction to be placed on a keyring.
(2) REVOKE - which allows a key to be revoked.
The SEARCH permission is split to create:
(1) SEARCH - which allows a keyring to be search and a key to be found.
(2) JOIN - which allows a keyring to be joined as a session keyring.
(3) INVAL - which allows a key to be invalidated.
The WRITE permission is also split to create:
(1) WRITE - which allows a key's content to be altered and links to be
added, removed and replaced in a keyring.
(2) CLEAR - which allows a keyring to be cleared completely. This is
split out to make it possible to give just this to an administrator.
(3) REVOKE - see above.
Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together. An ACE specifies a subject, such as:
(*) Possessor - permitted to anyone who 'possesses' a key
(*) Owner - permitted to the key owner
(*) Group - permitted to the key group
(*) Everyone - permitted to everyone
Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.
Further subjects may be made available by later patches.
The ACE also specifies a permissions mask. The set of permissions is now:
VIEW Can view the key metadata
READ Can read the key content
WRITE Can update/modify the key content
SEARCH Can find the key by searching/requesting
LINK Can make a link to the key
SET_SECURITY Can change owner, ACL, expiry
INVAL Can invalidate
REVOKE Can revoke
JOIN Can join this keyring
CLEAR Can clear this keyring
The KEYCTL_SETPERM function is then deprecated.
The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.
The KEYCTL_INVALIDATE function then requires INVAL.
The KEYCTL_REVOKE function then requires REVOKE.
The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.
The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.
======================
BACKWARD COMPATIBILITY
======================
To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.
It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.
SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
on if a keyring is being altered.
The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.
It will make the following mappings:
(1) INVAL, JOIN -> SEARCH
(2) SET_SECURITY -> SETATTR
(3) REVOKE -> WRITE if SETATTR isn't already set
(4) CLEAR -> WRITE
Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.
=======
TESTING
=======
This passes the keyutils testsuite for all but a couple of tests:
(1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
if the type doesn't have ->read(). You still can't actually read the
key.
(2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
work as Other has been replaced with Everyone in the ACL.
Signed-off-by: David Howells <dhowells@redhat.com>
This patch replaces a few leftover printk errors with calls to
fs_info and similar, so that the file system having the error is
properly logged.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, if a glock error was encountered, the glock with
the problem was dumped. But sometimes you may have lots of file systems
mounted, and that doesn't tell you which file system it was for.
This patch adds a new boolean parameter fsid to the dump_glock family
of functions. For non-error cases, such as dumping the glocks debugfs
file, the fsid is not dumped in order to keep lock dumps and glocktop
as clean as possible. For all error cases, such as GLOCK_BUG_ON, the
file system id is now printed. This will make it easier to debug.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_freeze had a case statement that simply checked the
error code, but the break statements just made the logic hard to
read. This patch simplifies the logic in favor of a simple if.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the superblock flag indicating when a file system
is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to
the more obvious SDF_WITHDRAWN.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch adds some instrumentation in gfs2's journal replay that
indicates when we're about to overwrite a rgrp for which we already
have a valid buffer_head.
When this problem occurs, it's a situation in which this node has
been granted a rgrp glock and subsequently read in buffer_heads for
it, and possibly even made changes to the rgrp bits and/or
allocation values. But now another node has failed and forced us to
replay its journal, but its journal contains a copy of the same
rgrp, without a revoke, which means we're about to overwrite a
rgrp that we now rightfully own, with an obsolete copy. That is
always a problem. It means the other node (which failed and left
its journal to be replayed) failed to flush out its rgrp buffers,
write out the revoke, and invalidate its copy before it released
the glock to our possession.
No node should ever release a glock until its metadata has been
written to the journal and revoked and invalidated..
We also kludge around the problem and refuse to replace our good
copy with the journals bad copy by not marking the buffer dirty,
but never do it silently. That's wallpapering over a larger problem
that still exists. IOW, if this situation can happen to this node,
it can also happen to a different node and we wouldn't even know it
or be able to circumvent it: Suppose we have a 3-node cluster:
Node 1 fails, leaving an obsolete rgrp block in its journal without
a revoke. Node 2 grabs the rgrp as soon as the rgrp glock is
released and starts making changes, allocating and freeing blocks
from the rgrp, etc. Node 3 replays the journal from node 1,
oblivious and unaware that it's about to overwrite node 2's changes.
So we still need to be vocal and log the error to make it apparent
that a corruption path still exists in gfs2.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a journal is replayed, gfs2 logs a message similar to:
jid=X: Replaying journal...
This patch adds the tail and block number so that the range of the
replayed block is also printed. These values will match the values
shown if the journal is dumped with gfs2_edit -p journalX. The
resulting output looks something like this:
jid=1: Replaying journal...0x28b7 to 0x2beb
This will allow us to better debug file system corruption problems.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
For its journal processing, gfs2 kept track of the number of buffers
added and removed on a per-transaction basis. These values are used
to calculate space needed in the journal. But while these calculations
make sense for the number of buffers, they make no sense for revokes.
Revokes are managed in their own list, linked from the superblock.
So it's entirely unnecessary to keep separate per-transaction counts
for revokes added and removed. A single count will do the same job.
Therefore, this patch combines the transaction revokes into a single
count.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2 saved the pointers to the two daemon threads
(logd and quotad) in the superblock, but they were never cleared,
even if the threads were stopped (e.g. on remount -o ro). That meant
that certain error conditions (like a withdrawn file system) could
race. For example, xfstests generic/361 caused an IO error during
remount -o ro, which caused the kthreads to be stopped, then the
error flagged. Later, when the test unmounted the file system, it
would try to stop the threads a second time with kthread_stop.
This patch does two things: First, every time it stops the threads
it zeroes out the thread pointer, and also checks whether it's NULL
before trying to stop it. Second, in function gfs2_remount_fs, it
was returning if an error was logged by either of the two functions
for gfs2_make_fs_ro and _rw, which caused it to bypass the online
uevent at the bottom of the function. This removes that bypass in
favor of just running the whole function, then returning the error.
That way, unmounts and remounts won't hang forever.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use IS_ERR_OR_NULL where appropriate.
(Several more places converted by Andreas.)
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a free_sbd function for freeing a struct gfs2_sbd. Use that for
freeing a super-block descriptor, either directly or via kobject_put.
Free sd_lkstats inside the kobject release function: that way,
gfs2_put_super will no longer leak sd_lkstats.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
fscrypt only uses SHA-256 for AES-128-CBC-ESSIV, which isn't the default
and is only recommended on platforms that have hardware accelerated
AES-CBC but not AES-XTS. There's no link-time dependency, since SHA-256
is requested via the crypto API on first use.
To reduce bloat, we should limit FS_ENCRYPTION to selecting the default
algorithms only. SHA-256 by itself isn't that much bloat, but it's
being discussed to move ESSIV into a crypto API template, which would
incidentally bring in other things like "authenc" support, which would
all end up being built-in since FS_ENCRYPTION is now a bool.
For Adiantum encryption we already just document that users who want to
use it have to enable CONFIG_CRYPTO_ADIANTUM themselves. So, let's do
the same for AES-128-CBC-ESSIV and CONFIG_CRYPTO_SHA256.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
When ceph_mdsc_build_path is handed a positive dentry, it will return a
zero-length path string with the base set to that dentry. This is not
what we want. Always include at least one path component in the string.
ceph_mdsc_build_path has behaved this way for a long time but it didn't
matter until recent d_name handling rework.
Fixes: 964fff7491 ("ceph: use ceph_mdsc_build_path instead of clone_dentry_name")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Remove the d_is_dir() check from tgid_pidfd_to_pid().
It is pointless since you should never get &proc_tgid_base_operations
for f_op on a non-directory.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
The code in quota_getstate and quota_getstatev is strange; it
says the returned fs_quota_stat[v] structure has room for only
one type of time limits, so fills it in with the first enabled
quota, even though every quotactl command must have a type sent
in by the user.
Instead of just picking the first enabled quota, fill in the
reply with the timers for the quota type that was actually
requested.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
With SQE links, we can create chains of dependent SQEs. One example
would be queueing an SQE that's a read from one file descriptor, with
the linked SQE being a write to another with the same set of buffers.
An SQE link will not stall the pipeline, it'll just ensure that
dependent SQEs aren't issued before the previous link has completed.
Any error at submission or completion time will break the chain of SQEs.
For completions, this also includes short reads or writes, as the next
SQE could depend on the previous one being fully completed.
Any SQE in a chain that gets canceled due to any of the above errors,
will get an CQE fill with -ECANCELED as the error value.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ever since the initial commit of the binfmt_flat shared library
support back in the bitkeeper days we've offset the actual in-memory
.data start by one field per possible shared library, or 1 in case
shared library support isn't enabled. I can't find anything in the
loader that actually makes use of it, nor was it present before
shared library support it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
MAX_SHARED_LIBS is an implementation detail of the kernel loader,
and should be kept away from the file format definition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
No need to carry the extra code around, given that systems using flat
binaries are generally very resource constrained.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
Allow architectures to opt into ARCH_HAS_BINFMT_FLAT support instead of
assuming that all nommu ports support the format.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
Most binfmt_flat on-disk fields are big endian. Use the proper __be32
type where applicable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
The split between the two flat.h files is completely arbitrary, and the
uapi version even contains CONFIG_ ifdefs that can't work in userspace.
The only userspace program known to use the header is elf2flt, and it
ships with its own version of the combined header.
Use the chance to move the <asm/flat.h> inclusion out of this file, as it
is in no way needed for the format defintion, but just for the binfmt
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
This will eventually allow us to kill the need for an <asm/flat.h> for
many cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
Instead add a Kconfig variable that only h8300 selects.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>