After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NTFS's if-condition on dirty inodes is not complete. Fix it with
sb_has_dirty_inodes().
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Streamline the management of dirty inode lists and fix time ordering bugs.
The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
s_io, *only to* move them back. The move-inodes-back-and-forth thing is a
mess, which is eliminated by this patch.
The new scheme is:
- s_dirty acts as a time ordered io delaying queue;
- s_io/s_more_io together acts as an io dispatching queue.
On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
every full scan of s_io. Otherwise (i.e. for sync/throttle/background
writeback), we always pull from s_dirty on each run (a partial scan).
Note that the line
list_splice_init(&sb->s_more_io, &sb->s_io);
is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
sit in s_io for a long time, preventing new expired inodes to get in.
Cc: Ken Chen <kenchen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.
The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode. If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again. We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.
So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This one fixes four bugs.
There are a few situation in there where writeback decides it is going to skip
over a blockdev inode on the kernel-internal blockdev superblock. It
presently does this by moving the blockdev inode onto the tail of the blockdev
superblock's s_dirty. But
a) this screws up s_dirty's reverse-time-orderedness and
b) refiling the blockdev for writeback in another 30 second is rude. We
should try again sooner than that.
Fix all this up by using redirty_head(): move the blockdev inode onto the head
of the blockdev superblock's s_dirty list for prompt writeback.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recycling the previous changelog:
When the writeback function is operating in writeback-for-flushing mode
(as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
it will skip writing that inode. This is done for throughput and latency:
move on to another inode rather than blocking for this one.
Writeback skips this inode by moving it off s_io and onto s_dirty, so that
writeback can proceed with the other inodes on s_io.
However that inode movement can corrupt s_dirty's
reverse-time-orderedness. Fix that by using the new redirty_tail(), which
will update the refiled inode's dirtied_when field.
Note: the behaviour in here is a bit rude: if kupdate happens to come
across a locked inode then it will defer writeback of that inode for another
30 seconds. We'll address that in the next patch.
Address that here. What we do is to move the skipped inode to the _head_ of
s_dirty, immediately eligible for writeout again. Instead of deferring that
writeout for another 30 seconds.
One would think that this might cause a livelock: we keep on trying to write
the same locked inode. But it won't because:
a) if that was the case, it would _already_ be happening on the
balance_dirty_pages codepath. Because balance_dirty_pages() doesn't care
about inode timestamps.
b) if we skipped this inode then we won't have done any writeback. The
higher-level writeback paths will see that wbc.nr_to_write didn't change
and they'll then back off and take a nap.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the writeback function is operating in writeback-for-flushing mode (as
opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
will skip writing that inode. This is done for throughput and latency: move
on to another inode rather than blocking for this one.
Writeback skips this inode by moving it off s_io and onto s_dirty, so that
writeback can proceed with the other inodes on s_io.
However that inode movement can corrupt s_dirty's reverse-time-orderedness.
Fix that by using the new redirty_tail(), which will update the refiled
inode's dirtied_when field.
Note: the behaviour in here is a bit rude: if kupdate happens to come across a
locked inode then it will defer writeback of that inode for another 30
seconds. We'll address that in the next patch.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a comment in there which claims that the inode is left on s_io
if nfs chickened out of writing some data.
But that's not been true for three years.
9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these
inodes back onto s_dirty. Fix the comment.
In the second leg of the `if', use redirty_tail() rather than open-coding it.
Add weaselly comment indicating lack of confidence in the code and lack of the
fortitude which would be needed to fiddle with it.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the kupdate function has tried to write back an expired inode it will
then check to see whether some of the inode's pages are still dirty.
This can happen when the filesystem decided to not write a page for some
reason. But it does _not_ occur due to redirtyings: a redirtying will set
I_DIRTY_PAGES.
What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then
put the inode onto the _head_ of s_dirty for consideration on the next kupdate
pass, in five seconds time.
Problem is, the code failed to modify the inode's timestamp when pushing the
inode onto thehead of s_dirty.
The patch:
If there are no other inodes on s_dirty then we leave the inode's timestamp
alone: it is already expired.
If there _are_ other inodes on s_dirty then we arrange for this inode to get
the same timestamp as the inode which is at the head of s_dirty, thus
preserving the s_dirty ordering. But we only need to do this if this inode
purports to have been dirtied before the one at head-of-list.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While writeback is working against a dirty inode it does a check after trying
to write some of the inode's pages:
"did the lower layers skip some of the inode's dirty pages because they were
locked (or under writeback, or whatever)"
If this turns out to be true, we must move the inode back onto s_dirty and
redirty it. The reason for doing this is that fsync() and friends only check
the s_dirty list, and those functions want to know about those pages which
were locked, so they can be waited upon and, if necessary, rewritten.
Problem is, that redirtying was putting the inode onto the tail of s_dirty
without updating its timestamp. This causes a violation of s_dirty ordering.
Fix this by updating inode->dirtied_when when moving the inode onto s_dirty.
But the code is still a bit buggy? If the inode was _already_ dirty then we
don't need to move it at all. Oh well, hopefully it doesn't matter too much,
as that was a redirtying, which was very recent anwyay.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For reasons which escape me, inodes which are dirty against a ram-backed
filesystem are managed in the same way as inodes which are backed by real
devices.
Probably we could optimise things here. But given that we skip the entire
supeblock as son as we hit the first dirty inode, there's not a lot to be
gained.
And the code does need to handle one particular non-backed superblock: the
kernel's fake internal superblock which holds all the blockdevs.
Still. At present when the code encounters an inode which is dirty against a
memory-backed filesystem it will skip that inode by refiling it back onto
s_dirty. But it fails to update the inode's timestamp when doing so which at
least makes the debugging code upset.
Fix.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When writeback has finished writing back an inode it looks to see if that
inode is still dirty. If it is, that means that a process redirtied the inode
while its writeback was in progress.
What we need to do here is to refile the redirtied inode onto the s_dirty
list.
But we're doing that wrongly: it could be that this inode was redirtied
_before_ the last inode on s_dirty. We're blindly appending this inode to the
list, after an inode which might be less-recently-dirtied, thus violating the
list's ordering.
So we must either insertion-sort this inode into the correct place, or we must
update this inode's dirtied_when field when appending it to the reverse-sorted
s_dirty list, to preserve the reverse-time-ordering.
This patch does the latter: if this inode was dirtied less recently than the
tail inode then copy the tail inode's timestamp into this inode.
This means that in rare circumstances, some inodes will be writen back later
than they should have been. But the time slip will be small.
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
the (few) files that fail to build because they were relying on blkdev.h
pulling in extra includes for them.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In __writeback_single_inode(), when we find a locked inode and we're not
doing a data-integrity sync, we used to just skip writing entirely,
since we didn't want to wait for the inode to unlock.
However, there's really no reason to skip writing the data pages, which
are likely to be the the bulk of the dirty state anyway (and the main
reason why writeback was started for the non-data-integrity case, of
course!)
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move blockdev_superblock extern declaration from fs/fs-writeback.c to a
headerfile and remove the dependence on it by wrapping it in a macro.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.
Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conversion of nr_unstable to a per zone counter
We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.
This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.
[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.
The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.
Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In here, I think the following order is more cache-friendly.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert to proper kernel-doc format.
Some have extra blank lines (not allowed immed. after the function name)
or need blank lines (after all parameters). Function summary must be only
one line.
Colon (":") in a function description does weird things (causes kernel-doc
to think that it's a new section head sadly).
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the backing_dev_info doesn't have BDI_CAP_NO_WRITEBACK we're not supposed
to write back an inode's pages. But in this situation write_inode_now()
refuses to write the inode itself as well. Fix.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the inode count is zero in inode writeback, the
WARN_ON(!(inode->i_state & I_WILL_FREE));
is broken, and needs to test for either I_WILL_FREE|I_FREEING.
When the inode is in I_FREEING state, it's already out of the visibility
of the vm so it can't be freed so it doesn't require the __iget and the
generic_delete_inode path can call the sync internally to the lowlevel
fs callback during the last iput. So the inode being in I_FREEING is
also a valid condition for calling the sync with i_count == 0.
The specific stack trace is this:
0xc00000007b8fb6e0 0xc00000000010118c .__writeback_single_inode +0x5c
0xc00000007b8fb6e0 0xc0000000001014dc (lr) .sync_inode +0x3c
0xc00000007b8fb790 0xc0000000001014dc .sync_inode +0x3c
0xc00000007b8fb820 0xc0000000001a5020 .ext2_sync_inode +0x64
0xc00000007b8fb8f0 0xc0000000001a65b4 .ext2_truncate +0x3f8
0xc00000007b8fba40 0xc0000000001a6940 .ext2_delete_inode +0xdc
0xc00000007b8fbac0 0xc0000000000f7a5c .generic_delete_inode +0x124
0xc00000007b8fbb50 0xc0000000000f5fe0 .iput +0xb8
0xc00000007b8fbbe0 0xc0000000000e9fd4 .sys_unlink +0x2a8
0xc00000007b8fbd10 0xc00000000001048c .ret_from_syscall_1 +0x0
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
list_move(&inode->i_list, &inode_in_use);
} else {
list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.
The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.
Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags clear
here:
spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
XXXXXXX
spin_lock(&inode_lock);
}
use inode again here
a construct like the above makes zero sense from a reference counting
standpoint.
Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.
So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).
Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).
The below (untested) patch fixes the nr_unused accounting, avoids recursing
in iput when I_WILL_FREE is set and makes sure (with the BUG_ON) that we
don't corrupt memory and that all holders that don't set I_WILL_FREE, keeps
a reference on the inode!
Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes O(n^2) super block loops in sync_inodes(),
sync_filesystems() etc. in favour of using __put_super_and_need_restart()
which I introduced earlier. We faced a noticably long freezes on sb
syncing when there are thousands of super blocks in the system.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some KernelDoc descriptions are updated to match the current code.
No code changes.
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!