This allows us to eliminate duplicate code, and eventually allow us to
also fold ext4_sops and ext4_nojournal_sops together.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.
Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Production fs likely compiled/mounted w/o jbd debugging, so orphan
list clearing will be silent.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If EIO happens after we have dropped j_state_lock, we won't notice
that the journal has been aborted. So it is reasonable to move this
check after we have grabbed the j_checkpoint_mutex and re-grabbed the
j_state_lock. This patch helps to prevent false positive complain
after EIO.
#DMESG:
__jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available
__jbd2_log_wait_for_space: no way to get more journal space in ram1-8
------------[ cut here ]------------
WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200()
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8
0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898
ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0
Call Trace:
[<ffffffff815c1a8c>] dump_stack+0x51/0x6d
[<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20
[<ffffffff812419f8>] __jbd2_log_wait_for_space+0x188/0x200
[<ffffffff8123be9a>] start_this_handle+0x4da/0x7b0
[<ffffffff810990e5>] ? local_clock+0x25/0x30
[<ffffffff810aba87>] ? lockdep_init_map+0xe7/0x180
[<ffffffff8123c5bc>] jbd2__journal_start+0xdc/0x1d0
[<ffffffff811f2414>] ? __ext4_new_inode+0x7f4/0x1330
[<ffffffff81222a38>] __ext4_journal_start_sb+0xf8/0x110
[<ffffffff811f2414>] __ext4_new_inode+0x7f4/0x1330
[<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
[<ffffffff812025bb>] ext4_create+0x8b/0x150
[<ffffffff8117fe3b>] vfs_create+0x7b/0xb0
[<ffffffff8118097b>] do_last+0x7db/0xcf0
[<ffffffff8117e31d>] ? inode_permission+0x4d/0x50
[<ffffffff811845d2>] path_openat+0x242/0x590
[<ffffffff81191a76>] ? __alloc_fd+0x36/0x140
[<ffffffff81184a6a>] do_filp_open+0x4a/0xb0
[<ffffffff81191b61>] ? __alloc_fd+0x121/0x140
[<ffffffff81172f20>] do_sys_open+0x170/0x220
[<ffffffff8117300e>] SyS_open+0x1e/0x20
[<ffffffff811715d6>] SyS_creat+0x16/0x20
[<ffffffff815c7e12>] system_call_fastpath+0x16/0x1b
---[ end trace cd71c831f82059db ]---
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Free the buffer head if the journal descriptor block fails checksum
verification.
This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum
verify error in do_one_pass".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
When loading extended attributes, check each entry's value offset to
make sure it doesn't collide with the entries.
Without this check it is easy to crash the kernel by mounting a
malicious FS containing a file with an EA wherein e_value_offs = 0 and
e_value_size > 0 and then deleting the EA, which corrupts the name
list.
(See the f_ea_value_crash test's FS image in e2fsprogs for an example.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tao Ma <tm@tao.ma>
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want. Rearrange the error
checking so that newer flags get complained about first.
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently sysfs feature files uses ext4_attr_ops as the file operations
to show/store data. However the feature files is not supposed to contain
any data at all, the sole existence of the file means that the module
support the feature. Moreover, none of the sysfs feature attributes
actually register show/store functions so that would not be a problem.
However if a sysfs feature attribute register a show or store function
we might be in trouble because the kobject in this case is _not_ embedded
in the ext4_sb_info structure as ext4_attr_show/store expect.
So just to be safe, provide separate empty sysfs_ops to use in
ext4_feat_ktype. This might safe us from potential problems in the
future. As a bonus we can "store" something more descriptive than
nothing in the files, so let it contain "enabled" to make it clear that
the feature is really present in the module.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is no easy way to tell that the mounted file system
contains errors other than checking for log messages, or reading the
information directly from superblock.
This patch adds new sysfs entries:
errors_count (number of fs errors we encounter)
first_error_time (unix timestamp for the first error we see)
last_error_time (unix timestamp for the last error we see)
If the file system is not marked as containing errors then any of the
file will return 0. Otherwise it will contain valid information. More
details about the errors should as always be found in the logs.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ext4 supports. Although
ext4 will support project quotas, use ext4 private definition for
consistency with other filesystems.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Sicne the jbd/jbd2 superblock is not released until the file system is
unmounted, allocate the buffer cache from the non-moveable area to
allow page migration and CMA allocations to more easily succeed.
Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Since the ext4 superblock is not released until the file system is
unmounted, allocate the buffer cache entry for the ext4 superblock out
of the non-moveable are to allow page migrations and thus CMA
allocations to more easily succeed if the CMA area is limited.
Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
A buffer cache is allocated from movable area because it is referred
for a while and released soon. But some filesystems are taking buffer
cache for a long time and it can disturb page migration.
New APIs are introduced to allocate buffer cache with user specific
flag. *_gfp APIs are for user want to set page allocation flag for
page cache allocation. And *_unmovable APIs are for the user wants to
allocate page cache from non-movable area.
Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the
last buffer in a transaction - and then we are done - or this isn't
and then we can just take another buffer from the checkpoint list
without dropping j_list_lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The __jbd2_journal_remove_checkpoint() doesn't require an elevated
b_count; indeed, until the jh structure gets released by the call to
jbd2_journal_put_journal_head(), the bh's b_count is elevated by
virtue of the existence of the jh structure.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().
Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.
This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything. If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Instead of initializing the allocation_request structure in
ext4_alloc_branch(), set it up in ext4_ind_map_blocks(), and then pass
it to ext4_alloc_branch() and ext4_splice_branch().
This allows ext4_ind_map_blocks to pass flags in the allocation
request structure without having to add Yet Another argument to
ext4_alloc_branch().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
This commit adds some statictics in extent status tree shrinker. The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker. Here we count
the following statictics:
stats:
the number of all objects on all extent status trees
the number of reclaimable objects on lru list
cache hits/misses
the last sorted interval
the number of inodes on lru list
average:
scan time for shrinking some objects
the number of shrunk objects
maximum:
the inode that has max nr. of objects on lru list
the maximum scan time for shrinking some objects
The output looks like below:
$ cat /proc/fs/ext4/sda1/es_shrinker_info
stats:
28228 objects
6341 reclaimable objects
5281/631 cache hits/misses
586 ms last sorted interval
250 inodes on lru list
average:
153 us scan time
128 shrunk objects
maximum:
255 inode (255 objects, 198 reclaimable)
125723 us max scan time
If the lru list has never been sorted, the following line will not be
printed:
586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
250 inodes on lru list
...
maximum:
255 inode (255 objects, 198 reclaimable)
0 us max scan time
Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit improves the trace point of extents status tree. We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.
Further this commit fixes a variable name in trace point in order to
keep consistency with others.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
__wait_cp_io() is only called by jbd2_log_do_checkpoint(). Fold it in
to make it a bit easier to understand.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
__process_buffer() is only called by jbd2_log_do_checkpoint(), and it
had a very complex locking protocol where it would be called with the
j_list_lock, and sometimes exit with the lock held (if the return code
was 0), or release the lock.
This was confusing both to humans and to smatch (which erronously
complained that the lock was taken twice).
Folding __process_buffer() to the caller allows us to simplify the
control flow, making the resulting function easier to read and reason
about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150
bytes (over 4% of the text size).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reuse the path object in ext4_move_extents() so we don't unnecessarily
free and reallocate it.
Also clean up the get_ext_path() wrapper so that it has the same
semantics of freeing the path object on error as ext4_ext_find_extent().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that the semantics of ext4_ext_find_extent() are much cleaner,
it's safe and more efficient to reuse the path object across the
multiple calls to ext4_ext_find_extent() in ext4_ext_shift_extents().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This adds additional safety in case for some reason we end reusing a
path structure which isn't big enough for current depth of the inode.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Teach ext4_ext_drop_refs() to accept a NULL argument, much like
kfree(). This allows us to drop a lot of checks to make sure path is
non-NULL before calling ext4_ext_drop_refs().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In nearly all of the calls to ext4_ext_find_extent() where the caller
is trying to recycle the path object, ext4_ext_drop_refs() gets called
to release the buffer heads before the path object gets overwritten.
To simplify things for the callers, and to avoid the possibility of a
memory leak, make ext4_ext_find_extent() responsible for dropping the
buffers.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(),
ext4_split_extent(), ext4_convert_unwritten_extents_endio().
This requires fixing all of their callers to potentially
ext4_ext_find_extent() to free the struct ext4_ext_path object in case
of an error, and there are interlocking dependencies all the way up to
ext4_ext_map_blocks(), ext4_swap_extents(), and
ext4_ext_remove_space().
Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it
is no longer necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The function ext4_convert_initialized_extents() is only called by a
single function --- ext4_ext_convert_initalized_extents(). Inline the
code and get rid of the unnecessary bits in order to simplify the code.
Rename ext4_ext_convert_initalized_extents() to
convert_initalized_extents() since it's a static function that is
actually only used in a single caller, ext4_ext_map_blocks().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Right now, there are a places where it is all to easy to leak memory
on an error path, via a usage like this:
struct ext4_ext_path *path = NULL
while (...) {
...
path = ext4_ext_find_extent(inode, block, path, 0);
if (IS_ERR(path)) {
/* oops, if path was non-NULL before the call to
ext4_ext_find_extent, we've leaked it! :-( */
...
return PTR_ERR(path);
}
...
}
Unfortunately, there some code paths where we are doing the following
instead:
path = ext4_ext_find_extent(inode, block, orig_path, 0);
and where it's important that we _not_ free orig_path in the case
where ext4_ext_find_extent() returns an error.
So change the function signature of ext4_ext_find_extent() so that it
takes a struct ext4_ext_path ** for its third argument, and by
default, on an error, it will free the struct ext4_ext_path, and then
zero out the struct ext4_ext_path * pointer. In order to avoid
causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes
ext4_ext_find_extent() to use the original behavior of forcing the
caller to deal with freeing the original path pointer on the error
case.
The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this
allows for a gentle transition and makes the patches easier to verify.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit b8a8684502 introduced an accidental flag aliasing between
EXT4_EX_NOCACHE and EXT4_GET_BLOCKS_CONVERT_UNWRITTEN.
Fortunately, this didn't introduce any untorward side effects --- we
got lucky. Nevertheless, fix this and leave a warning to hopefully
avoid this from happening in the future.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We accidently aliased EXT4_EX_NOCACHE and EXT4_GET_CONVERT_UNWRITTEN
falgs, which apparently was hiding a bug that was unmasked when this
flag aliasing issue was addressed (see the subsequent commit). The
reproduction case was:
fsx -N 10000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W /vdb/junk
... which would cause fsx to report corruption in the data file.
The fix we have is a bit of an overkill, but I'd much rather be
conservative for now, and we can optimize ZERO_RANGE_FL handling
later. The fact that we need to zap the extent_status cache for the
inode is unfortunate, but correctness is far more important than
performance.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
If ext4_ext_find_extent() returns an error, we have to clear path1 or
path2 or else we would end up trying to free an ERR_PTR, which would
be bad.
Also eliminate some redundant code and mark the error paths as unlikely()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_move_extents is too complex for review. It has duplicate almost
each function available in the rest of other codebase. It has useless
artificial restriction orig_offset == donor_offset. But in fact logic
of ext4_move_extents is very simple:
Iterate extents one by one (similar to ext4_fill_fiemap_extents)
->Iterate each page covered extent (similar to generic_perform_write)
->swap extents for covered by page (can be shared with IOC_MOVE_DATA)
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This allows us to make mext_next_extent static and potentially get rid
of it.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_journal_get_write_access() has just been called in ext4_append()
calling it again here is duplicated.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>