Pull XFS update (part 2) from Ben Myers:
"Fixes for tracing of xfs_name strings, flag handling in
open_by_handle, a log space hang with freeze/unfreeze, fstrim offset
calculations, a section mismatch with xfs_qm_exit, an oops in
xlog_recover_process_iunlinks, and a deadlock in xfs_rtfree_extent.
There are also additional trace points for attributes, and the
addition of a workqueue for allocation to work around kernel stack
size limitations."
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: add lots of attribute trace points
xfs: Fix oops on IO error during xlog_recover_process_iunlinks()
xfs: fix fstrim offset calculations
xfs: Account log unmount transaction correctly
xfs: don't cache inodes read through bulkstat
xfs: trace xfs_name strings correctly
xfs: introduce an allocation workqueue
xfs: Fix open flag handling in open_by_handle code
xfs: fix deadlock in xfs_rtfree_extent
fs: xfs: fix section mismatch in linux-next
When we read inodes via bulkstat, we generally only read them once
and then throw them away - they never get used again. If we retain
them in cache, then it simply causes the working set of inodes and
other cached items to be reclaimed just so the inode cache can grow.
Avoid this problem by marking inodes read by bulkstat not to be
cached and check this flag in .drop_inode to determine whether the
inode should be added to the VFS LRU or not. If the inode lookup
hits an already cached inode, then don't set the flag. If the inode
lookup hits an inode marked with no cache flag, remove the flag and
allow it to be cached once the current reference goes away.
Inodes marked as not cached will get cleaned up by the background
inode reclaim or via memory pressure, so they will still generate
some short term cache pressure. They will, however, be reclaimed
much sooner and in preference to cache hot inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Pull XFS updates from Ben Myers:
"Scalability improvements for dquots, log grant code cleanups, plus
bugfixes and cleanups large and small"
Fix up various trivial conflicts that were due to some of the earlier
patches already having been integrated into v3.3 as bugfixes, and then
there were development patches on top of those. Easily merged by just
taking the newer version from the pulled branch.
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (45 commits)
xfs: fallback to vmalloc for large buffers in xfs_getbmap
xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get
xfs: remove remaining scraps of struct xfs_iomap
xfs: fix inode lookup race
xfs: clean up minor sparse warnings
xfs: remove the global xfs_Gqm structure
xfs: remove the per-filesystem list of dquots
xfs: use per-filesystem radix trees for dquot lookup
xfs: per-filesystem dquot LRU lists
xfs: use common code for quota statistics
xfs: reimplement fdatasync support
xfs: split in-core and on-disk inode log item fields
xfs: make xfs_inode_item_size idempotent
xfs: log timestamp updates
xfs: log file size updates at I/O completion time
xfs: log file size updates as part of unwritten extent conversion
xfs: do not require an ioend for new EOF calculation
xfs: use per-filesystem I/O completion workqueues
quota: make Q_XQUOTASYNC a noop
xfs: include reservations in quota reporting
...
We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.
As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.
To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.
The worker function needs to ensure that it sets and clears the
PF_TRANS flag appropriately as it is being run in an active
transaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
New field of struct super_block - ->s_max_links. Maximal allowed
value of ->i_nlink or 0; in the latter case all checks still need
to be done in ->link/->mkdir/->rename instances. Note that this
limit applies both to directoris and to non-directories.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we initialize the slab caches for the quota code when XFS is loaded there
is no need for a global and reference counted quota manager structure. Drop
all this overhead and also fix the error handling during quota initialization.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add an in-memory only flag to say we logged timestamps only, and use it to
check if fdatasync can optimize away the log force.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Timestamps on regular files are the last metadata that XFS does not update
transactionally. Now that we use the delaylog mode exclusively and made
the log scode scale extremly well there is no need to bypass that code for
timestamp updates. Logging all updates allows to drop a lot of code, and
will allow for further performance improvements later on.
Note that this patch drops optimized handling of fdatasync - it will be
added back in a separate commit.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The new concurrency managed workqueues are cheap enough that we can create
per-filesystem instead of global workqueues. This allows us to remove the
trylock or defer scheme on the ilock, which is not helpful once we have
outstanding log reservations until finishing a size update.
Also allow the default concurrency on this workqueues so that I/O completions
blocking on the ilock for one inode do not block process for another inode.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage
of quota macros.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Replace i_pin_wait, which is only used during synchronous inode flushing
with a bit waitqueue. This trades off a much smaller inode against
slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
bytes in the XFS inode.
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
We almost never block on i_flock, the exception is synchronous inode
flushing. Instead of bloating the inode with a 16/24-byte completion
that we abuse as a semaphore just implement it as a bitlock that uses
a bit waitqueue for the rare sleeping path. This primarily is a
tradeoff between a much smaller inode and a faster non-blocking
path vs faster wakeups, and we are much better off with the former.
A small downside is that we will lose lockdep checking for i_flock, but
given that it's always taken inside the ilock that should be acceptable.
Note that for example the inode writeback locking is implemented in a
very similar way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (22 commits)
xfs: mark the xfssyncd workqueue as non-reentrant
xfs: simplify xfs_qm_detach_gdquots
xfs: fix acl count validation in xfs_acl_from_disk()
xfs: remove unused XBT_FORCE_SLEEP bit
xfs: remove XFS_QMOPT_DQSUSER
xfs: kill xfs_qm_idtodq
xfs: merge xfs_qm_dqinit_core into the only caller
xfs: add a xfs_dqhold helper
xfs: simplify xfs_qm_dqattach_grouphint
xfs: nest qm_dqfrlist_lock inside the dquot qlock
xfs: flatten the dquot lock ordering
xfs: implement lazy removal for the dquot freelist
xfs: remove XFS_DQ_INACTIVE
xfs: cleanup xfs_qm_dqlookup
xfs: cleanup dquot locking helpers
xfs: remove the sync_mode argument to xfs_qm_dqflush_all
xfs: remove xfs_qm_sync
xfs: make sure to really flush all dquots in xfs_qm_quotacheck
xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush
xfs: remove the lid_size field in struct log_item_desc
...
Fix up trivial conflict in fs/xfs/xfs_sync.c
Since Linux 2.6.36 the writeback code has introduces various measures for
live lock prevention during sync(). Unfortunately some of these are
actively harmful for the XFS model, where the inode gets marked dirty for
metadata from the data I/O handler.
The older_than_this checks that are now more strictly enforced since
writeback: avoid livelocking WB_SYNC_ALL writeback
by only calling into __writeback_inodes_sb and thus only sampling the
current cut off time once. But on a slow enough devices the previous
asynchronous sync pass might not have fully completed yet, and thus XFS
might mark metadata dirty only after that sampling of the cut off time for
the blocking pass already happened. I have not myself reproduced this
myself on a real system, but by introducing artificial delay into the
XFS I/O completion workqueues it can be reproduced easily.
Fix this by iterating over all XFS inodes in ->sync_fs and log all that
are dirty. This might log inode that only got redirtied after the
previous pass, but given how cheap delayed logging of inodes is it
isn't a major concern for performance.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
If the writeback code writes back an inode because it has expired we currently
use the non-blockin ->write_inode path. This means any inode that is pinned
is skipped. With delayed logging and a workload that has very little log
traffic otherwise it is very likely that an inode that gets constantly
written to is always pinned, and thus we keep refusing to write it. The VM
writeback code at that point redirties it and doesn't try to write it again
for another 30 seconds. This means under certain scenarious time based
metadata writeback never happens.
Fix this by calling into xfs_log_inode for kupdate in addition to data
integrity syncs, and thus transfer the inode to the log ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
On a system with lots of memory pressure that is stuck on synchronous inode
reclaim the workqueue code will run one instance of the inode reclaim work
item on every CPU. which is not what we want. Make sure to mark the
xfssyncd workqueue as non-reentrant to make sure there only is one instace
of each running globally. Also stop using special paramater for the
workqueue; now that we guarantee each fs has only running one of each works
at a time there is no need to artificially lower max_active and compensate
for that by setting the WQ_CPU_INTENSIVE flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now that we can't have any dirty dquots around that aren't in the AIL we
can get rid of the explicit dquot syncing from xfssyncd and xfs_fs_sync_fs
and instead rely on AIL pushing to write out any quota updates.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The delaylog mode has been the default for a long time, and the nodelaylog
option has been scheduled for removal in Linux 3.3. Remove it and code
only used by it now that we have opened the 3.3 window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Resolved conflicts:
fs/xfs/xfs_trans_priv.h:
- deleted struct xfs_ail field xa_flags
- kept field xa_log_flush in struct xfs_ail
fs/xfs/xfs_trans_ail.c:
- in xfsaild_push(), in XFS_ITEM_PUSHBUF case, replaced
"flush_log = 1" with "ailp->xa_log_flush++"
Signed-off-by: Alex Elder <aelder@sgi.com>
There is no reason to keep a reference to the inode even if we unlock
it during transaction commit because we never drop a reference between
the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
back into xfs_trans_ijoin - the third argument decides if an unlock
is needed now.
I'm actually starting to wonder if allowing inodes to be unlocked
at transaction commit really is worth the effort. The only real
benefit is that they can be unlocked earlier when commiting a
synchronous transactions, but that could be solved by doing the
log force manually after the unlock, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We now have an i_dio_count filed and surrounding infrastructure to wait
for direct I/O completion instead of i_icount, and we have never needed
to iocount waits for buffered I/O given that we only set the page uptodate
after finishing all required work. Thus remove i_iocount, and replace
the actually needed waits with calls to inode_dio_wait.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we have a few issues with the way the workqueue code is used to
implement AIL pushing:
- it accidentally uses the same workqueue as the syncer action, and thus
can be prevented from running if there are enough sync actions active
in the system.
- it doesn't use the HIGHPRI flag to queue at the head of the queue of
work items
At this point I'm not confident enough in getting all the workqueue flags and
tweaks right to provide a perfectly reliable execution context for AIL
pushing, which is the most important piece in XFS to make forward progress
when the log fills.
Revert back to use a kthread per filesystem which fixes all the above issues
at the cost of having a task struct and stack around for each mounted
filesystem. In addition this also gives us much better ways to diagnose
any issues involving hung AIL pushing and removes a small amount of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we always redirty an inode that was attempted to be written out
synchronously but has been cleaned by an AIL pushed internall, which is
rather bogus. Fix that by doing the i_update_core check early on and
return 0 for it. Also include async calls for it, as doing any work for
those is just as pointless. While we're at it also fix the sign for the
EIO return in case of a filesystem shutdown, and fix the completely
non-sensical locking around xfs_log_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97)
Signed-off-by: Alex Elder <aelder@sgi.com>
Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
annoying subdirectories in the XFS source code. Besides the large
amount of file rename the only changes are to the Makefile, a few
files including headers with the subdirectory prefix, and the binary
sysctl compat code that includes a header under fs/xfs/ from
kernel/.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>