This patch prevents the printing of a warning message in cases where
the fs is functioning normally by handing off responsibility for
unlinked, but still open inodes, to another node for eventual deallocation.
Also, there is now an improved system for ensuring that such requests
to other nodes do not get lost. The callback on the iopen lock is
only ever called when i_nlink == 0 and when a node is unable to deallocate
it due to it still being in use on another node. When a node receives
the callback therefore, it knows that i_nlink must be zero, so we mark
it as such (in gfs2_drop_inode) in order that it will then attempt
deallocation of the inode itself.
As an additional benefit, queuing a demote request no longer requires
a memory allocation. This simplifies the code for dealing with gfs2_holders
as it removes one special case.
There are two new fields in struct gfs2_glock. gl_demote_state is the
state which the remote node has requested and gl_demote_time is the
time when the request came in. Both fields are only valid when the
GLF_DEMOTE flag is set in gl_flags.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If you specify an invalid mount option when trying to mount a gfs2 filesystem,
gfs2 will oops. The attached patch resolves this problem.
Signed-off-by: Josef Whiter <jwhiter@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The attached patch resolves bz 228540. This adds the capability
for gfs2 to dump gfs2 locks through the debugfs file system.
This used to exist in gfs1 as "gfs_tool lockdump" but it's missing from
gfs2 because all the ioctls were stripped out. Please see the bugzilla
for more history about the fix. This patch is also attached to the bugzilla
record.
The patch is against Steve Whitehouse's latest nmw git tree kernel
(2.6.21-rc1) and has been tested on system trin-10.
Signed-off-by: Robert Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
fs/gfs2/glock.c:2198: error: 'THIS_MODULE' undeclared here (not in a function)
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Teigland fixed this bug a while back, but I managed to mistakenly
remove the semaphore during later development. It is required to avoid
the list of inodes changing during an invalidate_inodes call. I have
made it an rwsem since the read side will be taken frequently during
normal filesystem operation. The write site will only happen during
umount of the file system.
Also the bug only triggers when using the DLM lock manager and only then
under certain conditions as its timing related.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
This function is not longer required since we do not do recursive
locking in the glock layer. As a result all its callers can be
replaceed with list_empty() calls.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch doesn't make any changes to the ordering of the various
operations related to glocking, but it does tidy up the calls to the
glops.c functions to make the structure more obvious.
The two functions: gfs2_glock_xmote_th() and gfs2_glock_drop_th() can be
made static within glock.c since they are called by every set of glock
operations. The xmote_th and drop_th glock operations are then made
conditional upon those two routines existing and called from the
previously mentioned functions in glock.c respectively.
Also it can be seen that the go_sync operation isn't needed since it can
easily be replaced by calls to xmote_bh and drop_bh respectively. This
results in no longer (confusingly) calling back into routines in glock.c
from glops.c and also reducing the glock operations by one member.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is a patch for GFS2 to remove the local exclusive flag. In
the places it was used, mutex's are always held earlier in the
call path, so it appears redundant in the LM_ST_SHARED case.
Also, the GFS2 holders were setting local exclusive in any case where
the requested lock was LM_ST_EXCLUSIVE. So the other places in the glock
code where the flag was tested have been replaced with tests for the
lock state being LM_ST_EXCLUSIVE in order to ensure the logic is the
same as before (i.e. LM_ST_EXCLUSIVE is always locally exclusive as well
as globally exclusive).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The "greedy" code was an attempt to retain glocks for a minimum length
of time when they relate to mmap()ed files. The current implementation
of this feature is not, however, ideal in that it required allocating
memory in order to do this and its overly complicated.
It also misses the mark by ignoring the other I/O operations which are
just as likely to suffer from the same problem. So the plan is to remove
this now and then add the functionality back as part of the glock state
machine at a later date (and thus take into account all the possible
users of this feature)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is something I spotted (while looking for something entirely
different) the other day.
Rather than using a completion in each and every struct gfs2_holder,
this removes it in favour of hashed wait queues, thus saving a
considerable amount of memory both on the stack (where a number of
gfs2_holder structures are allocated) and in particular in the
gfs2_inode which has 8 gfs2_holder structures embedded within it.
As a result on x86_64 the gfs2_inode shrinks from 2488 bytes to
1912 bytes, a saving of 576 bytes per inode (no thats not a typo!).
In actual practice we get a much better result than that since
now that a gfs2_inode is under the 2048 byte barrier, we get two
per 4k slab page effectively halving the amount of memory required
to store gfs2_inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes the extra filldir callback which gfs2 was using to
enclose an attempt at readahead for inodes during readdir. The
code was too complicated and also hurts performance badly in the
case that the getdents64/readdir call isn't being followed by
stat() and it wasn't even getting it right all the time when it
was.
As a result, on my test box an "ls" of a directory containing 250000
files fell from about 7mins (freshly mounted, so nothing cached) to
between about 15 to 25 seconds. When the directory content was cached,
the time taken fell from about 3mins to about 4 or 5 seconds.
Interestingly in the cached case, running "ls -l" once reduced the time
taken for subsequent runs of "ls" to about 6 secs even without this
patch. Now it turns out that there was a special case of glocks being
used for prefetching the metadata, but because of the timeouts for these
locks (set to 10 secs) the metadata was being timed out before it was
being used and this the prefetch code was constantly trying to prefetch
the same data over and over.
Calling "ls -l" meant that the inodes were brought into memory and once
the inodes are cached, the glocks are not disposed of until the inodes
are pushed out of the cache, thus extending the lifetime of the glocks,
and thus bringing down the time for subsequent runs of "ls"
considerably.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* master.kernel.org:/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (73 commits)
[DLM] Clean up lowcomms
[GFS2] Change gfs2_fsync() to use write_inode_now()
[GFS2] Fix indent in recovery.c
[GFS2] Don't flush everything on fdatasync
[GFS2] Add a comment about reading the super block
[GFS2] Mount problem with the GFS2 code
[GFS2] Remove gfs2_check_acl()
[DLM] fix format warnings in rcom.c and recoverd.c
[GFS2] lock function parameter
[DLM] don't accept replies to old recovery messages
[DLM] fix size of STATUS_REPLY message
[GFS2] fs/gfs2/log.c:log_bmap() fix printk format warning
[DLM] fix add_requestqueue checking nodes list
[GFS2] Fix recursive locking in gfs2_getattr
[GFS2] Fix recursive locking in gfs2_permission
[GFS2] Reduce number of arguments to meta_io.c:getbuf()
[GFS2] Move gfs2_meta_syncfs() into log.c
[GFS2] Fix journal flush problem
[GFS2] mark_inode_dirty after write to stuffed file
[GFS2] Fix glock ordering on inode creation
...
Fix function parameter typing:
fs/gfs2/glock.c💯 warning: function declaration isn't a prototype
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This fixes a bug which resulted in poor performance due to flushing
the journal too often. The code path in question was via the inode_go_sync()
function in glops.c. The solution is not to flush the journal immediately
when inodes are ejected from memory, but batch up the work for glockd to
deal with later on. This means that glocks may now live on beyond the end of
the lifetime of their inodes (but not very much longer in the normal case).
Also fixed in this patch is a bug (which was hidden by the bug mentioned above) in
calculation of the number of free journal blocks.
The gfs2_logd process has been altered to be more responsive to the journal
filling up. We now wake it up when the number of uncommitted journal blocks
has reached the threshold level rather than trying to flush directly at the
end of each transaction. This again means doing fewer, but larger, log
flushes in general.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The go_sync callback took two flags, but one of them was set on every
call, so this patch removes once of the flags and makes the previously
conditional operations (on this flag), unconditional.
The go_inval callback took three flags, each of which was set on every
call to it. This patch removes the flags and makes the operations
unconditional, which makes the logic rather more obvious.
Two now unused flags are also removed from incore.h.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Change from GFP_KERNEL to GFP_NOFS as this was causing a
slow down when trying to push inodes from cache.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is no way to set the GL_DUMP flag, and in any case the
same thing can be done with systemtap if required for debugging,
so this removes it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes the duplicate di_mode field in favour of using the
inode->i_mode field. This saves 4 bytes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
lm_interface.h has a few out of the tree clients such as GFS1
and userland tools.
Right now, these clients keeps a copy of the file in their build tree
that can go out of sync.
Move lm_interface.h to include/linux, export it to userland and
clean up fs/gfs2 to use the new location.
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A one liner bug fix to prevent the return value being
wrong when more than one superblock is mounted.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Use atomic_t as the ref count in glocks rather than a kref.
This is another step towards using RCU for the glock hash.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This results in smaller list heads, so that we can have more chains
in the same amount of memory (twice as many). I've multiplied the
size of the table by four though - this is because we are saving
memory by not having one lock per chain any more. So we land up
using about the same amount of memory for the hash table as we
did before I started these changes, the difference being that we
now have four times as many hash chains.
The reason that I say "about the same amount of memory" is that the
actual amount now depends upon the NR_CPUS and some of the config
variables, so that its not exact and in some cases we do use more
memory. Eventually we might want to scale the hash table size
according to the size of physical ram as measured on module load.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The existing implementation of this function in glock.c was not
very efficient as it relied upon keeping a cursor element upon the
hash chain in question and moving it along. This new version improves
upon this by using the current element as a cursor. This is possible
since we only look at the "next" element in the list after we've
taken the read_lock() subsequent to calling the examiner function.
Obviously we have to eventually drop the ref count that we are then
left with and we cannot do that while holding the read_lock, so we
do that next time we drop the lock. That means either just before
we examine another glock, or when the loop has terminated.
The new implementation has several advantages: it uses only a
read_lock() rather than a write_lock(), so it can run simnultaneously
with other code, it doesn't need a "plug" element, so that it removes
a test not only from this list iterator, but from all the other glock
list iterators too. So it makes things faster and smaller.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add back the consts which were casted away in the glock sorting
function. Also add early exit code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Make the number of locks used for hash chains in glock.c
proportional to NR_CPUS. Also move constants for the number
of hash chains into glock.c from incore.h since they are
not used outside of glock.c.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This splits the rwlocks guarding the hash chains of the glock hash
table into their own array. This will reduce memory usage in some
cases due to better alignment, although the real reason for doing it
is to allow the two tables to be different sizes in future (i.e.
the locks will be sized proportionally with the max number of CPUs
and the hash chains sized proportinally with the size of physical memory)
In order to allow this, the gl_bucket member of struct gfs2_glock has
now become gl_hash, so we record the hash rather than a pointer to the
bucket itself.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As requested by Jan Engelhardt, this removes the typedefs in the
locking module interface and replaces them with void *. Also
since we are changing the interface, I've added a few consts
as well.
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes one of the typedefs from the locking interface. It
is replaced by a forward declaration of the gfs2 superblock. The
other two are not so easy to solve since in their case, they
can refer to one of two possible structures.
Cc: David Teigland <teigland@redhat.com>
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are several reasons why we want to do this:
- Firstly its large and thus we'll scale better with multiple
GFS2 fs mounted at the same time
- Secondly its easier to scale its size as required (thats a plan
for later patches)
- Thirdly, we can use kzalloc rather than vmalloc when allocating
the superblock (its now only 4888 bytes)
- Fourth its all part of my plan to eventually be able to use RCU
with the glock hash.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is another patch preparing for sharing of the glock hash
table between different gfs2 mounts.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This makes all fixed size types have consistent names.
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As per Jan Engelhardt's second email, this removes some unused code,
and fixes up indenting in various places.
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As per comments from Jan Engelhardt <jengelh@linux01.gwdg.de> this
updates the copyright message to say "version" in full rather than
"v.2". Also incore.h has been updated to remove forward structure
declarations which are not required.
The gfs2_quota_lvb structure has now had endianess annotations added
to it. Also quota.c has been updated so that we now store the
lvb data locally in endian independant format to avoid needing
a structure in host endianess too. As a result the endianess
conversions are done as required at various points and thus the
conversion routines in lvb.[ch] are no longer required. I've
moved the one remaining constant in lvb.h thats used into lm.h
and removed the unused lvb.[ch].
I have not changed the HIF_ constants. That is left to a later patch
which I hope will unify the gh_flags and gh_iflags fields of the
struct gfs2_holder.
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds the superblock as a key for glock lookups. Since the glocks
are already stored in a per-superblock table, this has no effect at
the moment. Later on this will change though.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We can take advantage of the slab allocator to ensure that all the list
heads and the spinlock (plus one or two other fields) are initialised
by slab to speed up allocation of glocks.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Remove the unused sync feature from glocks. This is currently done by
calling the required functions to sync pages/blocks directly so this
code isn't needed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For all the usual reasons of enforcing correctness and potentially
reducing code size, this patch makes the glock operations const.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch allows the simultaneous mounting of gfs2meta and gfs2
filesystems. A restriction however is that a gfs2meta fs may only be
mounted if its corresponding gfs2 filesystem is also mounted. Also, a
gfs2 filesystem cannot be unmounted before its gfs2meta filesystem.
Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I noticed the gfs2_scand seemed to be taking a lot of CPU,
so in order to cut that down a bit, here is a patch. Firstly
the type of a glock is a constant during its lifetime, so that
its possible to check this without needing locking. I've moved
the (common) case of testing for an inode glock outside of
the glmutex lock.
Also there was a mutex left over from when the glock cache was
master of the inode cache. That isn't required any more so I've
removed that too.
There is probably scope for further speed ups in the future
in this area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
recovery.c add a brelse to deal with gfs2_replay_read_block being called
twice on the same block.
add a dput to drop the ref count on the root inode.
This was causing lingering glocks and thus causing
a mount failure to hang.
Fix a endian conversion macro that was was swizzling
16bits when it should have been swizzling 32.
Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This means that we don't need to create a special inode just to contain
a struct address_space in order to read a single disk block. Instead
we read the disk block directly. Its slightly faster, and uses slightly
less memory, but the real reason for doing this is that it removes a
special case from the glock code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This really is the correct fix this time. We just ignore all
glocks associated with inodes until the inodes are pushed
from the inode cache. At that point the glocks are queued for
reclaim, so we don't need to do it here.
Also fix one or two other minor bugs.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Under certain circumstances the glock scanning logic would
demote locks which ought not to have been selected for
demotion.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes one instance of GFP_NOFAIL from the glock callback
function. It also fixes a bug where a , was used at a line end
rather than ; causing unintended results.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes the way we have been dealing with unlinked,
but still open files. It removes all limits (other than memory
for inodes, as per every other filesystem) on numbers of these
which we can support on GFS2. It also means that (like other
fs) its the responsibility of the last process to close the file
to deallocate the storage, rather than the person who did the
unlinking. Note that with GFS2, those two events might take place
on different nodes.
Also there are a number of other changes:
o We use the Linux inode subsystem as it was intended to be
used, wrt allocating GFS2 inodes
o The Linux inode cache is now the point which we use for
local enforcement of only holding one copy of the inode in
core at once (previous to this we used the glock layer).
o We no longer use the unlinked "special" file. We just ignore it
completely. This makes unlinking more efficient.
o We now use the 4th block allocation state. The previously unused
state is used to track unlinked but still open inodes.
o gfs2_inoded is no longer needed
o Several fields are now no longer needed (and removed) from the in
core struct gfs2_inode
o Several fields are no longer needed (and removed) from the in core
superblock
There are a number of future possible optimisations and clean ups
which have been made possible by this patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds some extra debugging to glock.c and changes
inode.c's deallocation code to call the debugging code
at a suitable moment. I'm chasing down a particular bug
to do with deallocation at the moment and the code can
go again once the bug is fixed.
Also this includes the first part of some changes to unify
the Linux struct inode and GFS2's struct gfs2_inode. This
transformation will happen in small parts over the next short
period.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We no longer use semaphores, everything has been converted to
mutex or rwsem, so we don't need to include this header any more.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds readpages support (and also corrects a small bug in
the readpage error path at the same time). Hopefully this will
improve performance by allowing GFS to submit larger lumps of
I/O at a time.
In order to simplify the setting of BH_Boundary, it currently gets
set when we hit the end of a indirect pointer block. There is
always a boundary at this point with the current allocation code.
It doesn't get all the boundaries right though, so there is still
room for improvement in this.
See comments in fs/gfs2/ops_address.c for further information about
readpages with GFS2.
Signed-off-by: Steven Whitehouse
This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 unused functions
- remove the following global function that was both unused and
unimplemented:
- super.c: gfs2_do_upgrade()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Despite my earlier careful search, there was a recursive lock left
in the deallocation code. This removes it. It also should speed up
deallocation be reducing the number of locking operations which take
place by using two "try lock" operations on the two locks involved in
inode deallocation which allows us to grab the locks out of order
(compared with NFS which grabs the inode lock first and the iopen
lock later). It is ok for us to fail while doing this since if it
does fail it means that someone else is still using the inode and
thus it wouldn't be possible to deallocate anyway.
This fixes the bug reported to me by Rob Kenna.
Cc: Rob Kenna <rkenna@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the last user of recursive locking so that
it no longer needs this feature and removes it from the glock
layer. This makes the glock code a lot simpler and easier to
understand. Its also a prerequsite to adding support for the
AOP_TRUNCATED_PAGE return code (or at least it is if you don't
want your brain to melt in the process)
I've left in a couple of checks just in case there is some place
else in the code which is still using this feature that I didn't
spot yet, but they can probably be removed long term.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
At some stage, a mutex was added to gfs2_glock_put() without
checking all its call sites. Two of them were called from
under a spinlock causing random delays at various points and
crashes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When allocating memory to sort directory entries, use vmalloc()
rather than kmalloc() since for larger directories, the required
size can easily be graeter than the 128k maximum of kmalloc().
Also adding the first steps towards getting the AOP_TRUNCATED_PAGE
return code get in the glock code by flagging all places where we
request a glock and we are holding a page lock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the debugging code in trans.c and at the same time improve
the debugging code for gfs2_holders. The new code should be pretty
fast during the normal case and provide just as much information
in case of errors (or more).
One small function from glock.c has moved to glock.h as a static inline so
that its return address won't get in the way of the debugging.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As suggested by Pekka Enberg <penberg@cs.helsinki.fi>.
The DIV_RU macro is renamed DIV_ROUND_UP and and moved to kernel.h
The other macros are gone from gfs2.h as (although not requested
by Pekka Enberg) are a number of included header file which are now
included individually. The inode number comparison function is
now an inline function.
The DT2IF and IF2DT may be addressed in a future patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
All printk calls now have KERN_ set where required and a couple of
kmalloc(), memset(.., 0, ...) calls changed to kzalloc().
This is in response to comments from:
Pekka Enberg <penberg@cs.helsinki.fi> and
Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As well as a number of minor bug fixes, this patch changes GFS
to use mutices rather than semaphores. This results in better
information in case there are any locking problems.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch contains all the core files for GFS2.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>