The r_mds field is redundant, since we can find the same information at
r_session->s_mds, and when r_session is NULL then r_mds is meaningless.
Signed-off-by: Sage Weil <sage@newdream.net>
This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS. This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.
Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests. This will result in a forward
and degrade performance. It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.
Signed-off-by: Sage Weil <sage@newdream.net>
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.
Signed-off-by: Sage Weil <sage@newdream.net>
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined. It isn't for those callers, so check!
Signed-off-by: Sage Weil <sage@newdream.net>
The user buffer may be 512-byte aligned, not page-aligned. We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously the kernel client incorrectly assumed everything was a directory.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
last may be NULL, but we dereference it in the else branch without
checking. Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix readdir EOVERFLOW on 32-bit archs
ceph: fix frag offset for non-leftmost frags
ceph: fix dangling pointer
ceph: explicitly specify page alignment in network messages
ceph: make page alignment explicit in osd interface
ceph: fix comment, remove extraneous args
ceph: fix update of ctime from MDS
ceph: fix version check on racing inode updates
ceph: fix uid/gid on resent mds requests
ceph: fix rdcache_gen usage and invalidate
ceph: re-request max_size if cap auth changes
ceph: only let auth caps update max_size
ceph: fix open for write on clustered mds
ceph: fix bad pointer dereference in ceph_fill_trace
ceph: fix small seq message skipping
Revert "ceph: update issue_seq on cap grant"
One of the readdir filldir_t callers was passing the raw ceph 64-bit ino
instead of the hashed 32-bit one, producing an EOVERFLOW in the filler
callback. Fix this by calling the ceph_vino_to_ino() helper to do the
conversion.
Reported-by: Jan Smets <jan.smets@alcatel-lucent.com>
Tested-by: Jan Smets <jan.smets@alcatel-lucent.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We start at offset 2 for the leftmost frag, and 0 for subsequent frags.
When we reach the end (rightmost), we go back to 2. This fixes readdir on
fragmented (large) directories.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.
Explicitly specify the page alignment when setting up OSD IO requests.
Signed-off-by: Sage Weil <sage@newdream.net>
The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.
This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).
Signed-off-by: Sage Weil <sage@newdream.net>
We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have. The
exception is when an MDS sense unstable information, in which case we
always update.
The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied. Fixed and clarified the comment.
Signed-off-by: Sage Weil <sage@newdream.net>
MDS requests can be rebuilt and resent in non-process context, but were
filling in uid/gid from current_fsuid/gid. Put that information in the
request struct on request setup.
This fixes incorrect (and root) uid/gid getting set for requests that
are forwarded between MDSs, usually due to metadata migrations.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.
First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.
That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)
Signed-off-by: Sage Weil <sage@newdream.net>
If the auth cap migrates to another MDS, clear requested_max_size so that
we resend any pending max_size increase requests. This fixes potential
hangs on writes that extend a file and race with an cap migration between
MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.
Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.
Signed-off-by: Sage Weil <sage@newdream.net>
Normally when we open a file we already have a cap, and simply update the
wanted set. However, if we open a file for write, but don't have an auth
cap, that doesn't work; we need to open a new cap with the auth MDS. Only
reuse existing caps if we are opening for read or the existing cap is auth.
Signed-off-by: Sage Weil <sage@newdream.net>
We dereference *in a few lines down, but only set it on rename. It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
This reverts commit d91f2438d8.
The intent of issue_seq is to distinguish between mds->client messages that
(re)create the cap and those that do not, which means we should _only_ be
updating that value in the create paths. By updating it in handle_cap_grant,
we reset it to zero, which then breaks release.
The larger question is what workload/problem made me think it should be
updated here...
Signed-off-by: Sage Weil <sage@newdream.net>
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work. Since we don't actually need it here anyway, remove
it.
We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.
Signed-off-by: Sage Weil <sage@newdream.net>
Convert a sequence of kmalloc and memcpy to use kmemdup.
The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@
a =
- \(kmalloc\|kzalloc\)(len,flag)
+ kmemdup(arg,len,flag)
<... when != a
if (a == NULL || ...) S
...>
- memcpy(a,arg,len+1);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:
fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.
Signed-off-by: Sage Weil <sage@newdream.net>
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held. Preallocate space without
the lock, and retry if the lock state changes out from underneath us.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:
- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).
No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.
Signed-off-by: Sage Weil <sage@newdream.net>
Allow the messenger to send/receive data in a bio. This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.
We can now have trailing variable sized data for osd
ops. Also osd ops encoding is more modular.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Implement a pool lookup by name. This will be used by rbd.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We need to update the issue_seq on any grant operation, be it via an MDS
reply or a separate grant message. The update in the grant path was
missing. This broke cap release for inodes in which the MDS sent an
explicit grant message that was not soon after followed by a successful
MDS reply on the same inode.
Also fix the signedness on seq locals.
Signed-off-by: Sage Weil <sage@newdream.net>
If an MDS tries to revoke caps that we don't have, we want to send
releases early since they probably contain the caps message the MDS
is looking for.
Previously, we only sent the messages if we didn't have the inode either. But
in a multi-mds system we can retain the inode after dropping all caps for
a single MDS.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
encode_fh on error should update max_len with minimum required
size, so that caller can redo the call with the reallocated buffer.
This is required with open by handle patch series
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
encode_fh function should return 255 on error as done by other file
system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units
and not in bytes.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we interrupt an osd request, we call __cancel_request, but it wasn't
verifying that req->r_osd was non-NULL before dereferencing it. This could
cause a crash if osds were flapping and we aborted a request on said osd.
Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
See if the i_data mapping has any pages to determine if the FILE_CACHE
capability is currently in use, instead of assuming it is any time the
rdcache_gen value is set (i.e., issued -> used).
This allows the MDS RECALL_STATE process work for inodes that have cached
pages.
Signed-off-by: Sage Weil <sage@newdream.net>
Sending multiple flushsnap messages is problematic because we ignore
the response if the tid doesn't match, and the server may only respond to
each one once. It's also a waste.
So, skip cap_snaps that are already on the flushing list, unless the caller
tells us to resend (because we are reconnecting).
Signed-off-by: Sage Weil <sage@newdream.net>
The cap_snap creation/queueing relies on both the current i_head_snapc
_and_ the i_snap_realm pointers being correct, so that the new cap_snap
can properly reference the old context and the new i_head_snapc can be
updated to reference the new snaprealm's context. To fix this, we:
- move inodes completely to the new (split) realm so that i_snap_realm
is correct, and
- generate the new snapc's _before_ queueing the cap_snaps in
ceph_update_snap_trace().
Signed-off-by: Sage Weil <sage@newdream.net>
Stop sending FLUSHSNAP messages when we hit a capsnap that has dirty_pages
or is still writing. We'll send the newer capsnaps only after the older
ones complete.
Signed-off-by: Sage Weil <sage@newdream.net>
The 'follows' should match the seq for the snap context for the given snap
cap, which is the context under which we have been dirtying and writing
data and metadata. The snapshot that _contains_ those updates thus
_follows_ that context's seq #.
Signed-off-by: Sage Weil <sage@newdream.net>
When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset. This can cause the readdir result offsets
to get out of sync with the server. Add an argument to the helper so
that it does not.
This bug was introduced by 1cd3935bed.
Signed-off-by: Sage Weil <sage@newdream.net>
Cast the value before shifting so that we don't run out of bits with a
32-bit unsigned long. This fixes wrapping of high file offsets into the
low 4GB of a file on disk, and the subsequent data corruption for large
files.
Signed-off-by: Sage Weil <sage@newdream.net>
Fix the reconnect encoding to encode the cap record when the MDS does not
have the FLOCK capability (i.e., pre v0.22).
Signed-off-by: Sage Weil <sage@newdream.net>
When we release a root dentry, particularly after a splice, the parent
(actually our) inode was evaluating to NULL and was getting dereferenced
by ceph_snap(). This is reproduced by something as simple as
mount -t ceph monhost:/a/b mnt
mount -t ceph monhost:/a mnt2
ls mnt2
A splice_dentry() would kill the old 'b' inode's root dentry, and we'd
crash while releasing it.
Fix by checking for both the ROOT and NULL cases explicitly. We only need
to invalidate the parent dir when we have a correct parent to invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>
get_ticket_handler() returns a valid pointer or it returns
ERR_PTR(-ENOMEM) if kzalloc() fails.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_mdsc_build_path() returns an ERR_PTR but this code is set up to
handle NULL returns.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Just scrubbing some warnings so I can see real problem ones in the build
noise. For 32bit we need to coax gcc politely into believing we really
honestly intend to the casts. Using (u64)(unsigned long) means we cast from
a pointer to a type of the right size and then extend it. This stops the
warning spew.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use i_head_snapc to keep track of which snapc the current epoch
of dirty data was dirtied under. It is used by queue_cap_snap to set up
the cap_snap. However, since we queue cap snaps for any dirty caps, not
just for dirty file data, we need to keep a valid i_head_snapc anytime
we have dirty|flushing caps. This fixes a NULL pointer deref in
queue_cap_snap when writing back dirty caps without data (e.g.,
snaptest-authwb.sh).
Signed-off-by: Sage Weil <sage@newdream.net>
Fix argument order. We want to move the item to the end of the list, not
change the position of the head.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty. If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.
Signed-off-by: Sage Weil <sage@newdream.net>
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.
Signed-off-by: Michael Rubin <mrubin@google.com>
Signed-off-by: Sage Weil <sage@newdream.net>
When making a request in the virtual snapdir or a snapped portion of the
namespace, we should choose the MDS based on the first nonsnap parent (and
its caps). If that is not the best place, we will get forward hints to
find the right MDS in the cluster. This fixes ESTALE errors when using
the .snap directory and namespace with multiple MDSs.
Signed-off-by: Sage Weil <sage@newdream.net>
When a realm is updated, we need to queue writeback on inodes in that
realm _and_ its children. Otherwise, if the inode gets cowed on the
server, we can get a hang later due to out-of-sync cap/snap state.
Signed-off-by: Sage Weil <sage@newdream.net>
When we snapshot dirty metadata that needs to be written back to the MDS,
include dirty xattr metadata. Make the capsnap reference the encoded
xattr blob so that it will be written back in the FLUSHSNAP op.
Also fix the capsnap creation guard to include dirty auth or file bits,
not just tests specific to dirty file data or file writes in progress
(this fixes auth metadata writeback).
Signed-off-by: Sage Weil <sage@newdream.net>
We should include the xattr metadata blob in the cap update message any
time we are flushing dirty state, NOT just when we are also dropping the
cap. This fixes async xattr writeback.
Also, clean up the code slightly to avoid duplicating the bit test.
Signed-off-by: Sage Weil <sage@newdream.net>
The use of a completion when waiting for session shutdown during umount is
inappropriate, given the complexity of the condition. For multiple MDS's,
this resulted in the umount thread spinning, often preventing the session
close message from being processed in some cases.
Switch to a waitqueue and defined a condition helper. This cleans things
up nicely.
Signed-off-by: Sage Weil <sage@newdream.net>
Generalize the current statfs synchronous requests, and support pool_ops.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Normally, if the Fb cap bit is being revoked, we queue an async writeback.
If there is no dirty data but we still hold the cap, this leaves the
client sitting around doing nothing until the cap timeouts expire and the
cap is released on its own (as it would have been without the revocation).
Instead, only queue writeback if the bit is actually used (i.e., we have
dirty data). If not, we can reply to the revocation immediately.
Signed-off-by: Sage Weil <sage@newdream.net>
Implement flock inode operation to support advisory file locking. All
lock/unlock operations are synchronous with the MDS. Lock state is
sent when reconnecting to a recovering MDS to restore the shared lock
state.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Define the MDS operations and data types for doing file advisory locking
with the MDS.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This informs the server that we will accept v2 client_caps format and v2
client_reconnect format messages.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Encode either old or v2 encoding of client_reconnect message, depending on
whether the peer has the FLOCK feature bit.
Signed-off-by: Sage Weil <sage@newdream.net>
The pool info contains a vector for snap_info_t, not snap ids. This fixes
the broken decoding, which would declare teh update corrupt when a pool
snapshot was created.
Signed-off-by: Sage Weil <sage@newdream.net>
The ->sync_fs() super op only needs to wait if wait is true. Otherwise,
just get some dirty cap writeback started.
Signed-off-by: Sage Weil <sage@newdream.net>
Specify the supported/required feature bits in super.h client code instead
of using the definitions from the shared kernel/userspace headers (which
will go away shortly).
Signed-off-by: Sage Weil <sage@newdream.net>
When we get a cap EXPORT message, make sure we are connected to all export
targets to ensure we can handle the matching IMPORT.
Signed-off-by: Sage Weil <sage@newdream.net>
If an MDS we are talking to may have failed, we need to open sessions to
its potential export targets to ensure that any in-progress migration that
may have involved some of our caps is properly handled.
Signed-off-by: Sage Weil <sage@newdream.net>
Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
If we have a capsnap but no auth cap (e.g. because it is migrating to
another mds), bail out and do nothing for now. Do NOT remove the capsnap
from the flush list.
Signed-off-by: Sage Weil <sage@newdream.net>
The caps revocation should either initiate writeback, invalidateion, or
call check_caps to ack or do the dirty work. The primary question is
whether we can get away with only checking the auth cap or whether all
caps need to be checked.
The old code was doing...something else. At the very least, revocations
from non-auth MDSs could break by triggering the "check auth cap only"
case.
Signed-off-by: Sage Weil <sage@newdream.net>
If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it. Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.
Signed-off-by: Sage Weil <sage@newdream.net>
If we have marked a file as "lazy" (using the ceph ioctl), perform buffered
writes when the MDS caps allow it.
Signed-off-by: Sage Weil <sage@newdream.net>
Allow an application to mark a file descriptor for lazy file consistency
semantics, allowing buffered reads and writes when multiple clients are
accessing the same file.
Signed-off-by: Sage Weil <sage@newdream.net>
Also clean up the file flags -> file mode -> wanted caps functions while
we're at it. This resyncs this file with userspace.
Signed-off-by: Sage Weil <sage@newdream.net>
This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it. Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.
Signed-off-by: Sage Weil <sage@newdream.net>
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree. Also fix a leak in the __insert_pg_mapping() error path.
Signed-off-by: Sage Weil <sage@newdream.net>
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released. We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache. Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.
Signed-off-by: Sage Weil <sage@newdream.net>
We should always go to the MDS for readdir on the hidden snapdir. The
set of snapshots can change at any time; the client can't trust its cache
for that.
Signed-off-by: Sage Weil <sage@newdream.net>
Strip the cap and dentry releases from replayed messages. They can
cause the shared state to get out of sync because they were generated
(with the request message) earlier, and no longer reflect the current
client state.
Signed-off-by: Sage Weil <sage@newdream.net>
Replayed rename operations (after an mds failure/recovery) were broken
because the request paths were regenerated from the dentry names, which
get mangled when d_move() is called.
Instead, resend the previous request message when replaying completed
operations. Just make sure the REPLAY flag is set and the target ino is
filled in.
This fixes problems with workloads doing renames when the MDS restarts,
where the rename operation appears to succeed, but on mds restart then
fails (leading to client confusion, app breakage, etc.).
Signed-off-by: Sage Weil <sage@newdream.net>
The buffer was too small. Make it bigger, use snprintf(), put brackets
around the ipv6 address to avoid mixing it up with the :port, and use the
ever-so-handy %pI[46] formats.
Signed-off-by: Sage Weil <sage@newdream.net>
A message can be on a queue (pending or sent), or out_msg (sending), or
both. We were assuming that if it's not on a queue it couldn't be out_msg,
but that was false in the case of lossy connections like the OSD. Fix
ceph_con_revoke() to treat these cases independently. Also, fix the
out_kvec_is_message check to only trigger if we are currently sending
_this_ message.
This fixes a GPF in tcp_sendpage, triggered by OSD restarts.
Signed-off-by: Sage Weil <sage@newdream.net>
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.
Signed-off-by: Sage Weil <sage@newdream.net>
We can drop caps with an mds request. Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it. If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.
This fixes a possibly loss of metadata.
Signed-off-by: Sage Weil <sage@newdream.net>
We may not recurse for CHOOSE_LEAF if we start with a leaf node. When
that happens, the out2 vector needs to be filled in with the result.
Signed-off-by: Sage Weil <sage@newdream.net>
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache. If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.
This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct. I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).
Signed-off-by: Sage Weil <sage@newdream.net>
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing. (Also, report the error code.)
Signed-off-by: Sage Weil <sage@newdream.net>
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.
Signed-off-by: Sage Weil <sage@newdream.net>
We need to properly initialize skip, as not all alloc_msg op instances
set it.
Also, BUG if someone says skip but also allocates a message.
Signed-off-by: Sage Weil <sage@newdream.net>
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately. This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.
If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.
Signed-off-by: Sage Weil <sage@newdream.net>
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.
Signed-off-by: Sage Weil <sage@newdream.net>
bdi_seq is an atomic_long_t but we're using ATOMIC_INIT, which causes
build failures on ia64. This patch fixes it to use ATOMIC_LONG_INIT.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If the client revokes a lease with a higher seq than what we have, keep
the mds's seq, so that it honors our release. Otherwise, we can hang
indefinitely.
Signed-off-by: Sage Weil <sage@newdream.net>
We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX.
That disagrees with ceph_lookup behavior (which checks against NAME_MAX),
and also makes the pjd posix test suite spit out ugly errors because with
can't clean up its temporary files.
Signed-off-by: Sage Weil <sage@newdream.net>
We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.
This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: clean up on forwarded aborted mds request
ceph: fix leak of osd authorizer
ceph: close out mds, osd connections before stopping auth
ceph: make lease code DN specific
fs/ceph: Use ERR_CAST
ceph: renew auth tickets before they expire
ceph: do not resend mon requests on auth ticket renewal
ceph: removed duplicated #includes
ceph: avoid possible null dereference
ceph: make mds requests killable, not interruptible
sched: add wait_for_completion_killable_timeout
If an mds request is aborted (timeout, SIGKILL), it is left registered to
keep our state in sync with the mds. If we get a forward notification,
though, we know the request didn't succeed and we can unregister it
safely. We were trying to resend it, but then bailing out (and not
unregistering) in __do_request.
Signed-off-by: Sage Weil <sage@newdream.net>
The auth module (part of the mon_client) is needed to free any
ceph_authorizer(s) used by the mds and osd connections. Flush the msgr
workqueue before stopping monc to ensure that the destroy_authorizer
auth op is available when those connections are closed out.
Signed-off-by: Sage Weil <sage@newdream.net>
The lease code includes a mask in the CEPH_LOCK_* namespace, but that
namespace is changing, and only one mask (formerly _DN == 1) is used, so
hard code for that value for now.
If we ever extend this code to handle leases over different data types we
can extend it accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
We were only requesting renewal after our tickets expire; do so before
that. Most of the low-level logic for this was already there; just use
it.
Signed-off-by: Sage Weil <sage@newdream.net>
We only want to send pending mon requests when we successfully
authenticate. If we are already authenticated, like when we renew our
ticket, there is no need to resend pending requests.
Signed-off-by: Sage Weil <sage@newdream.net>
fs/ceph/auth.c: linux/slab.h is included more than once.
fs/ceph/super.h: linux/slab.h is included more than once.
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Sage Weil <sage@newdream.net>
ac->ops may be null; use protocol id in error message instead.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The underlying problem is that many mds requests can't be restarted. For
example, a restarted create() would return -EEXIST if the original request
succeeds. However, we do not want a hung MDS to hang the client too. So,
use the _killable wait_for_completion variants to abort on SIGKILL but
nothing else.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
ceph: reuse mon subscribe message instead of allocated anew
ceph: avoid resending queued message to monitor
ceph: Storage class should be before const qualifier
ceph: all allocation functions should get gfp_mask
ceph: specify max_bytes on readdir replies
ceph: cleanup pool op strings
ceph: Use kzalloc
ceph: use common helper for aborted dir request invalidation
ceph: cope with out of order (unsafe after safe) mds reply
ceph: save peer feature bits in connection structure
ceph: resync headers with userland
ceph: use ceph. prefix for virtual xattrs
ceph: throw out dirty caps metadata, data on session teardown
ceph: attempt mds reconnect if mds closes our session
ceph: clean up send_mds_reconnect interface
ceph: wait for mds OPEN reply to indicate reconnect success
ceph: only send cap releases when mds is OPEN|HUNG
ceph: dicard cap releases on mds restart
ceph: make mon client statfs handling more generic
ceph: drop src address(es) from message header [new protocol feature]
...
Use the same message, allocated during startup. No need to reallocate a
new one each time around (and potentially ENOMEM).
Signed-off-by: Sage Weil <sage@newdream.net>
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.
The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The auth_reply handler will (re)send any pending requests. For the
initial mon authenticate phase, that's correct, but when a auth ticket
renewal races with an in-flight request, we may resend a request message
that is already in flight. Avoid this by revoking the message before
sending it.
We should also avoid resending requests at all during ticket renewal; that
will come soon.
Signed-off-by: Sage Weil <sage@newdream.net>
The C99 specification states in section 6.11.5:
The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sage Weil <sage@newdream.net>
This is essential, as for the rados block device we'll need
to run in different contexts that would need flags that
are other than GFP_NOFS.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Specify max bytes in request to bound size of reply. Add associated
mount option with default value of 512 KB.
Signed-off-by: Sage Weil <sage@newdream.net>
Use kzalloc rather than the combination of kmalloc and memset.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x,size,flags;
statement S;
@@
-x = kmalloc(size,flags);
+x = kzalloc(size,flags);
if (x == NULL) S
-memset(x, 0, size);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay. Use common helper to avoid duplicate code.
Signed-off-by: Sage Weil <sage@newdream.net>
The remove_session_caps() helper is called when an MDS closes out our
session (either normally, or as a result of a failed reconnect), and when
we tear down state for umount. If we remove the last cap, and there are
no cap migrations in progress, then there is little hope of us flushing
out that data to the mds (without heroic efforts to reconnect and flush).
So, to avoid leaving inodes pinned (due to dirty state) and crashing after
umount, throw out dirty caps state and unpin the inodes. Print a warning
to the console so we know something was lost.
NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean;
maybe a truncate should be queued?
Signed-off-by: Sage Weil <sage@newdream.net>
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.
Change client to attempt an immediate reconnect if our session is closed.
Note that currently the MDS doesn't support this, and our attempt will
fail. We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect. That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.
Signed-off-by: Sage Weil <sage@newdream.net>
Pass a ceph_mds_session, since the caller has it.
Remove the dead code for sending empty reconnects. It used to be used
when the MDS contacted _us_ to solicit a reconnect, and we could reply
saying "go away, I have no session." Now we only send reconnects based
on the mds map, and only when we do in fact have an open session.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to infer reconnect success by watching the MDS state, essentially
assuming that hearing nothing meant things were ok. That wasn't
particularly reliable. Instead, the MDS replies with an explicit OPEN
message to indicate success.
Strictly speaking, this is a protocol change, but it is a backwards
compatible one that does not break new clients + old servers or old
clients + new servers. At least not yet.
Drop unused @all argument from kick_requests while we're at it.
Signed-off-by: Sage Weil <sage@newdream.net>
On OPENING we shouldn't have any caps (or releases).
On CLOSING, we should wait until we succeed (and throw it all out), or
don't (and are OPEN again).
On RECONNECTING we can wait until we are OPEN.
Signed-off-by: Sage Weil <sage@newdream.net>
If the MDS restarts, the expire caps state is no longer shared, and can be
thrown out. Caps state will be rebuilt on the MDS during the reconnect
process that follows. Zero out any release messages and adjust the
release counter accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
This is being done so that we could reuse the statfs
infrastructure with other requests that return values.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source
address in each message header (twice). This patch switches the client to
the new scheme, and _requires_ this feature on the server. The server
will support both the old and new schemes. That means an old client will
work with a new server, but a new client will not work with an old server.
Signed-off-by: Sage Weil <sage@newdream.net>
The bdi_setup_and_register() helper doesn't help us since we bdi_init() in
create_client() and bdi_register() only when sget() succeeds.
Signed-off-by: Sage Weil <sage@newdream.net>
We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry(). Notably, we should NOT assign an
offset when a dentry is first created and is still null.
BUG if we try to splice a non-null dentry (we shouldn't).
Signed-off-by: Sage Weil <sage@newdream.net>