dlm protocol 1.1. activates messages DLM_QUERY_REGION and DLM_QUERY_NODEINFO
that are a must for global heartbeat.
It also activates o2hb_global_heartbeat_active().
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
This BUG is there since the first submit of the code, but only triggered
in last Kernel. It's timing related do to the asynchronous object-creation
behaviour of exofs. (Which should be investigated farther)
The bug is obvious hence the fixed.
Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>
Since commits 990cb8acf2 and cc92c28b2d, it is possible to have full
address space layout randomization (ASLR) on ARM. Except that one small
change was missing for ASLR of PIE executables.
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
We need to update the issue_seq on any grant operation, be it via an MDS
reply or a separate grant message. The update in the grant path was
missing. This broke cap release for inodes in which the MDS sent an
explicit grant message that was not soon after followed by a successful
MDS reply on the same inode.
Also fix the signedness on seq locals.
Signed-off-by: Sage Weil <sage@newdream.net>
If an MDS tries to revoke caps that we don't have, we want to send
releases early since they probably contain the caps message the MDS
is looking for.
Previously, we only sent the messages if we didn't have the inode either. But
in a multi-mds system we can retain the inode after dropping all caps for
a single MDS.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
encode_fh on error should update max_len with minimum required
size, so that caller can redo the call with the reallocated buffer.
This is required with open by handle patch series
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
encode_fh function should return 255 on error as done by other file
system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units
and not in bytes.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we interrupt an osd request, we call __cancel_request, but it wasn't
verifying that req->r_osd was non-NULL before dereferencing it. This could
cause a crash if osds were flapping and we aborted a request on said osd.
Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
When marking an inode reclaimable, a per-AG counter is increased, the
inode is tagged reclaimable in its per-AG tree, and, when this is the
first reclaimable inode in the AG, the AG entry in the per-mount tree
is also tagged.
When an inode is finally reclaimed, however, it is only deleted from
the per-AG tree. Neither the counter is decreased, nor is the parent
tree's AG entry untagged properly.
Since the tags in the per-mount tree are not cleared, the inode
shrinker iterates over all AGs that have had reclaimable inodes at one
point in time.
The counters on the other hand signal an increasing amount of slab
objects to reclaim. Since "70e60ce xfs: convert inode shrinker to
per-filesystem context" this is not a real issue anymore because the
shrinker bails out after one iteration.
But the problem was observable on a machine running v2.6.34, where the
reclaimable work increased and each process going into direct reclaim
eventually got stuck on the xfs inode shrinking path, trying to scan
several million objects.
Fix this by properly unwinding the reclaimable-state tracking of an
inode when it is reclaimed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@kernel.org
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This patch adds a per region debugfs file that shows the elapsed time
since the time the o2hb timer was last armed.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
This patch creates debugfs directory for each o2hb region and creates
files to expose the region number and the per region live node bitmap.
This information will be useful in debugging cluster issues.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
This patch prints the bitmaps of live, quorum and failed regions. This
information will be useful in debugging cluster issues.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In global heartbeat mode, we track the bitmap of regions that have seen
heartbeat timeouts. We fence if the number of such regions is greater than
or equal to half the number of quorum regions.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
o2hb allows online adding of regions. However, a newly added region is not
used in quorum calculations unless it has been added on all nodes. This patch
tracks a bitmap of such quorum regions.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In global heartbeat mode, we have a upper limit for the number of active regions.
This patch adds the facility to track the number of active global heartbeat
regions and fails to start heartbeat if the number exceeds the maximum.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Currently we track a global livenode bitmap that keeps track of all nodes
that are heartbeating in all regions.
This patch adds the ability to track the livenode bitmap on a per region basis.
We will use this facility in a later patch to allow us to withstand the loss of
a minority number of regions.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
o2hb currently checks slots for configured nodes only. This patch makes
it check the slots for the live nodes too to take care of a race in which
a node is removed from the configuration but not from the live map.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Prints messages when the user adds or removes heartbeat regions in global
heartbeat mode. These messages are useful when debugging cluster related issues.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Adds new dlm message DLM_QUERY_NODEINFO that sends the attributes of all
registered nodes. This message is sent if the negotiated dlm protocol is
1.1 or higher. If the information of the joining node does not match
that of any existing nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In global heartbeat mode, the heartbeat is started by the user. This patch
prints an error if the user attempts to mount a volume without starting the
heartbeat.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Adds new dlm message DLM_QUERY_REGION that sends the names of all active
heartbeat regions. This message is only sent in the global heartbeat
mode. If the regions in the joining node do not fully match the ones in
the active nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Export function in o2hb to get a list of heartbeat regions. It also adds an
upper limit to the length of the heartbeat region name.
o2hb_global_heartbeat_active() currently disables global heartbeat. It will
be enabled in a later patch after all the code is added.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.
This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.
This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Add heartbeat mode parameter to the configfs tree. This will be used
to set/show the heartbeat mode. The user is free to toggle the mode
between local and global as long as there is no active heartbeat region.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Mostly the glock operations follow the type of the glock. The
one exception is the transaction glock, so we need to check for
that directly.
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags. We're also using for tracking dirty inodes for
writeback.
To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.
Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.
The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently. For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this
function
Initialize total_len to zero, else its value will be undefined.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: prevent infinite recursion in cifs_reconnect_tcon
cifs: set backing_dev_info on new S_ISREG inodes
Having the limits file world readable will ease the task of system
management on systems where root privileges might be restricted.
Having admin restricted with root priviledges, he/she could not check
other users process' limits.
Also it'd align with most of the /proc stat files.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Eugene Teo <eugene@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs_reconnect_tcon is called from smb_init. After a successful
reconnect, cifs_reconnect_tcon will call reset_cifs_unix_caps. That
function will, in turn call CIFSSMBQFSUnixInfo and CIFSSMBSetFSUnixInfo.
Those functions also call smb_init.
It's possible for the session and tcon reconnect to succeed, and then
for another cifs_reconnect to occur before CIFSSMBQFSUnixInfo or
CIFSSMBSetFSUnixInfo to be called. That'll cause those functions to call
smb_init and cifs_reconnect_tcon again, ad infinitum...
Break the infinite recursion by having those functions use a new
smb_init variant that doesn't attempt to perform a reconnect.
Reported-and-Tested-by: Michal Suchanek <hramrach@centrum.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When renaming over a directory we need to use hfsplus_rmdir instead of
hfsplus_unlink to evict the victim. This makes sure we properly error out
on non-empty directory as required by Posix (BZ #16571), and it also makes
sure we do the right thing in case i_nlink will every be set correctly for
directories on hfsplus.
Reported-by: Vlado Plaga <rechner@vlado-do.de>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Most of the extent handling code already does proper SMP locking, but
hfsplus_write_inode was calling into hfsplus_ext_write_extent without
taking the extents_lock. Fix this by splitting hfsplus_ext_write_extent
into an internal helper that expects the lock, and a public interface
that first acquires it.
Also add a few locking asserts and document the locking rules in
hfsplus_fs.h.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
We already have i_mutex for readdir and the namespace operations that add
entries to open_dir_list, the only thing that was missing was the removal
in hfsplus_dir_release.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
The flags in the HFS+-specific superlock do get modified during runtime,
use atomic bitops to make the modifications SMP safe.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Lock updates to the mutal fields in the volume header, and document the
locing in the hfsplus_sb_info structure.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
We never walk the list - the only reason for it is to make the resource fork
inodes appear hashed to the writeback code. Borrow a trick from JFS to do
that without needing a list head.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
We never look at it, nor change the next_alloc field in the superblock. So
don't bother caching it or writing it out in hfsplus_sync_fs.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Add a new hfsplus_system_write_inode for writing the special system inodes
and streamline the fastpath write_inode code.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Add a new hfsplus_system_read_inode for reading the special system inodes
and streamline the fastpath iget code.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
HFSPLUS_I doesn't return a pointer to the hfsplus-specific inode
information like all other FOO_I macros, but dereference the pointer in a way
that made it look like a direct struct derefence. This only works as long
as the HFSPLUS_I macro is used directly and prevents us from keepig a local
hfsplus_inode_info pointer. Fix the calling convention and introduce a local
hip variable in all functions that use it constantly.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
HFSPLUS_SB doesn't return a pointer to the hfsplus-specific superblock
information like all other FOO_SB macros, but dereference the pointer in a way
that made it look like a direct struct derefence. This only works as long
as the HFSPLUS_SB macro is used directly and prevents us from keepig a local
hfsplus_sb_info pointer. Fix the calling convention and introduce a local
sbi variable in all functions that use it constantly.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Except for ->put_super the BKL is now gone from HFS, which means it's
superflous there too as ->put_super is serialized by the VFS.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Use alloc_mutex to protect hfsplus_sync_fs against itself and concurrent
allocations, which allows to get rid of lock_super in hfsplus.
Note that most fields in the superblock still aren't protected against
concurrent allocations, that will follow later.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Use a new per-sb alloc_mutex instead of abusing i_mutex of the alloc_file
to protect block allocations. This gets rid of lockdep nesting warnings
and prepares for extending the scope of alloc_mutex.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Use i_mutex for protecting against concurrent setflags ioctls like in
other filesystems and get rid of the BKL in hfsplus_ioctl.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
Currenly the HFSPLUS_IOC_EXT2_GETFLAGS case never unlocks the BKL, which
can lead to easily reproduced lockups when doing multiple GETFLAGS ioctls.
Fix this by only taking the BKL for the HFSPLUS_IOC_EXT2_SETFLAGS case
as neither HFSPLUS_IOC_EXT2_GETFLAGS not the default error case needs it.
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
This patch fixes a GFS2 problem whereby the first rename after a
mount can result in a file system consistency error being flagged
improperly and cause the file system to withdraw. The problem is
that the rename code tries to run the rgrp list with function
gfs2_blk2rgrpd before the rgrp list is guaranteed to be read in
from disk. The patch makes the rename function hold the rindex
glock (as the gfs2_unlink code does today) which reads in the rgrp
list if need be. There were a total of three places in the rename
code that improperly referenced the rgrp list without the rindex
glock and this patch fixes all three.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
ocfs2 fast symlinks are NUL terminated strings stored inline in the
inode data area. However, disk corruption or a local attacker could, in
theory, remove that NUL. Because we're using strlen() (my fault,
introduced in a731d1 when removing vfs_follow_link()), we could walk off
the end of that string.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
Testing on very recent kernel (2.6.36-rc6) made this warning pop:
WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x65/0x70()
Hardware name:
Dirtiable inode bdi default != sb bdi cifs
...the following patch fixes it and seems to be the obviously correct
thing to do for cifs.
Cc: stable@kernel.org
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recently a feature was added to GFS2 to allow journal id allocation
via sysfs. This patch builds upon that so that a negative journal id
will be treated as an error code to be passed back as the return code
from mount. This allows termination of the mount process if there is
a failure.
Also, the process has been updated so that the kernel will wait
for a journal id, even in the "spectator" case. This is required
in order to avoid mounting a filesystem in case there is an error
while joining the cluster. In the spectator case, 0 is written into
the file to indicate that all is well, and that mount should continue.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
XFS supports the "norecovery" mount option which is basically the
same as the GFS2 spectator mode. This adds support for "norecovery"
as a synonym for spectator mode, which is hopefully a more obvious
description of what it actually does.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The tests further down the recovery function relating to
unlocking the journal need to be updated to match the
intial test. Also, a test in the umount code which was
surplus to requirements has been removed. Umounting
spectator mounts now works correctly, as expected.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well. IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.
Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.
However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?
Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.
The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.
It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....
The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This shouldn't really be required, but gcc can't tell that
"al" is only accessed when initialised.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some of the functions in GFS2 were not reserving space in the transaction for
the resource group header and the resource groups bitblocks that get added
when you do allocation. GFS2 now makes sure to reserve space for the
resource group header and either all the bitblocks in the resource group, or
one for each block that it may allocate, whichever is smaller using the new
gfs2_rg_blocks() inline function.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When checking journals for spectator mounts, we cannot rely on the
journal being locked, whatever its jid might be. This patch
ensures that we always get the journal locks when checking
journals for a spectator mount.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
o2dlm: force free mles during dlm exit
ocfs2: Sync inode flags with ext2.
ocfs2: Move 'wanted' into parens of ocfs2_resmap_resv_bits.
ocfs2: Use cpu_to_le16 for e_leaf_clusters in ocfs2_bg_discontig_add_extent.
ocfs2: update ctime when changing the file's permission by setfacl
ocfs2/net: fix uninitialized ret in o2net_send_message_vec()
Ocfs2: Handle empty list in lockres_seq_start() for dlmdebug.c
Ocfs2: Re-access the journal after ocfs2_insert_extent() in dxdir codes.
ocfs2: Fix lockdep warning in reflink.
ocfs2/lockdep: Move ip_xattr_sem out of ocfs2_xattr_get_nolock.
This option has never done anything useful. Also at the same time
this cleans up the sb checks which are done at mount time. The
debug option will be accepted, but ignored in future. Since it
didn't do anything, there didn't seem much point in retaining it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.
This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We sync our inode flags with ext2 and define them by hex
values. But actually in commit 3669567(4 years ago), all
these values are moved to include/linux/fs.h. So we'd
better also use them as what ext2 did. So sync our inode
flags with ext2 by using FS_*.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The first time I read the function ocfs2_resmap_resv_bits, I consider
about what 'wanted' will be used and consider about the comments.
Then I find it is only used if the reservation is empty. ;)
So we'd better move it to the parens so that it make the code more
readable, what's more, ocfs2_resmap_resv_bits is used so frequently
and we should save some cpus.
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
e_leaf_clusters is a le16, so use cpu_to_le16 instead
of cpu_to_le32.
What's more, we change 'clusters' to unsigned int to
signify that the size of 'clusters' isn't important here.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In commit 30e2bab, ext3 fixed it. So change it accordingly in ocfs2.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1283760364
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1283760364
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This option defaulted to on for lock_nolock mounts and off
otherwise. The only function was to avoid the revalidation of
dentries. In the cluster case, that is entirely pointless and
liable to cause coherency problems.
The patch changes the revalidation to depend upon whether the
fs is a local or cluster fs (i.e. it follows the existing default
behaviour). I very much doubt anybody ever used this option as
there is no reason to. Even so we will continue to accept it
on the mount command line, but ignore it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is been a no-op for a very long time now. I'm pretty sure
nobody uses it, but just in case we'll still accept it on the
command line, but ignore it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
OCFS2 can return ERESTARTSYS from its write function when the process is
signalled while waiting for a cluster lock (and the filesystem is mounted
with intr mount option). Generally, it seems reasonable to allow
filesystems to return this error code from its IO functions. As we must
not leak ERESTARTSYS (and similar error codes) to userspace as a result of
an AIO operation, we have to properly convert it to EINTR inside AIO code
(restarting the syscall isn't really an option because other AIO could
have been already submitted by the same io_submit syscall).
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 73296bc611 ("procfs: Use generic_file_llseek in /proc/vmcore")
broke seeking on /proc/vmcore. This changes it back to use default_llseek
in order to restore the original behaviour.
The problem with generic_file_llseek is that it only allows seeks up to
inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
file systems. We should merge generic_file_llseek and default_llseek some
day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
is the safer solution.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 32-bit compatibility mode, the error handling for
compat_do_readv_writev() may free an uninitialized pointer, potentially
leading to all sorts of ugly memory corruption. This is reliably
triggerable by unprivileged users by invoking the readv()/writev()
syscalls with an invalid iovec pointer. The below patch fixes this to
emulate the non-compat version.
Introduced by commit b83733639a ("compat: factor out
compat_rw_copy_check_uvector from compat_do_readv_writev")
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Cc: stable@kernel.org (2.6.35)
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends
char: Mark /dev/zero and /dev/kmem as not capable of writeback
bdi: Initialize noop_backing_dev_info properly
cfq-iosched: fix a kernel OOPs when usb key is inserted
block: fix blk_rq_map_kern bio direction flag
cciss: freeing uninitialized data on error path
Inodes of devices such as /dev/zero can get dirty for example via
utime(2) syscall or due to atime update. Backing device of such inodes
(zero_bdi, etc.) is however unable to handle dirty inodes and thus
__mark_inode_dirty complains. In fact, inode should be rather dirtied
against backing device of the filesystem holding it. This is generally a
good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
in these pseudofilesystems are referenced from ordinary filesystem
inodes and carry mapping with real data of the device. Thus for these
inodes we have to use inode->i_mapping->backing_dev_info as we did so
far. We distinguish these filesystems by checking whether sb->s_bdi
points to a non-trivial backing device or not.
Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
There's a device inode A described by a path "/dev/sdb" on this
filesystem. This inode will be dirtied against backing device "8:0"
after this patch. bdev filesystem contains block device inode B coupled
with our inode A. When someone modifies a page of /dev/sdb, it's B that
gets dirtied and the dirtying happens against the backing device "8:16".
Thus both inodes get filed to a correct bdi list.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
These devices don't do any writeback but their device inodes still can get
dirty so mark bdi appropriately so that bdi code does the right thing and files
inodes to lists of bdi carrying the device inodes.
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: select CRYPTO
ceph: check mapping to determine if FILE_CACHE cap is used
ceph: only send one flushsnap per cap_snap per mds session
ceph: fix cap_snap and realm split
ceph: stop sending FLUSHSNAPs when we hit a dirty capsnap
ceph: correctly set 'follows' in flushsnap messages
ceph: fix dn offset during readdir_prepopulate
ceph: fix file offset wrapping at 4GB on 32-bit archs
ceph: fix reconnect encoding for old servers
ceph: fix pagelist kunmap tail
ceph: fix null pointer deref on anon root dentry release
Rather than calculating the qstrs for . and .. each time
we need them, its better to keep a constant version of
these and just refer to them when required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
The recovery workqueue can be freezable since
we want it to finish what it is doing if the system is to
be frozen (although why you'd want to freeze a cluster node
is beyond me since it will result in it being ejected from
the cluster). It does still make sense for single node
GFS2 filesystems though.
The glock workqueue will benefit from being able to run more
work items concurrently. A test running postmark shows
improved performance and multi-threaded workloads are likely
to benefit even more. It needs to be high priority because
the latency directly affects the latency of filesystem glock
operations.
The delete workqueue is similar to the recovery workqueue in
that it must not get blocked by memory allocations, and may
run for a long time.
Potentially other GFS2 threads might also be converted to
workqueues, but I'll leave that for a later patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
GFS2's idea of which return codes it needs to handle was based
upon those listed in dlm.h. Those didn't cover all the possible
codes and listed some which never happen. This updates GFS2 to
handle all the codes which can actually be returned from the
DLM under various circumstances.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Due to the design of the VFS, it is quite usual for operations on GFS2
to consist of a lookup (requiring a shared lock) followed by an
operation requiring an exclusive lock. If a remote node has cached an
exclusive lock, then it will receive two demote events in rapid succession
firstly for a shared lock and then to unlocked. The existing min hold time
code was triggering in this case, even if the node was otherwise idle
since the state change time was being updated by the initial demote.
This patch introduces logic to skip the min hold timer in the case that
a "double demote" of this kind has occurred. The min hold timer will
still be used in all other cases.
A new glock flag is introduced which is used to keep track of whether
there have been any newly queued holders since the last glock state
change. The min hold time is only applied if the flag is set.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
This patch adds support for fallocate to gfs2. Since the gfs2 does not support
uninitialized data blocks, it must write out zeros to all the blocks. However,
since it does not need to lock any pages to read from, gfs2 can write out the
zero blocks much more efficiently. On a moderately full filesystem, fallocate
works around 5 times faster on average. The fallocate call also allows gfs2 to
add blocks to the file without changing the filesize, which will make it
possible for gfs2 to preallocate space for the rindex file, so that gfs2 can
grow a completely full filesystem.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds a check to ensure that if we reach the block allocator
that we don't try and proceed if there is no alloc structure
hanging off the inode. This should only happen if there is a bug
in GFS2. The error return code is distinctive in order that it
will be easily spotted.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With the update of the truncate code, ip->i_disksize and
inode->i_size are merely copies of each other. This means
we can remove ip->i_disksize and use inode->i_size exclusively
reducing the size of a GFS2 inode by 8 bytes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>