What was the purpose of this?
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 28c0254ede)
Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init(). Rearrange the arguments so the connection pointer
is first. Hide the byte-swapping of the peer entity number inside
ceph_con_init()
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.
Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"
Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...
A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.
Switch the ceph_client structure to embed its ceph_messenger
structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Merge misc patches from Andrew Morton:
- the "misc" tree - stuff from all over the map
- checkpatch updates
- fatfs
- kmod changes
- procfs
- cpumask
- UML
- kexec
- mqueue
- rapidio
- pidns
- some checkpoint-restore feature work. Reluctantly. Most of it
delayed a release. I'm still rather worried that we don't have a
clear roadmap to completion for this work.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
kconfig: update compression algorithm info
c/r: prctl: add ability to set new mm_struct::exe_file
c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
syscalls, x86: add __NR_kcmp syscall
fs, proc: introduce /proc/<pid>/task/<tid>/children entry
sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
eventfd: change int to __u64 in eventfd_signal()
fs/nls: add Apple NLS
pidns: make killed children autoreap
pidns: use task_active_pid_ns in do_notify_parent
rapidio/tsi721: add DMA engine support
rapidio: add DMA engine support for RIO data transfers
ipc/mqueue: add rbtree node caching support
tools/selftests: add mq_perf_tests
ipc/mqueue: strengthen checks on mqueue creation
ipc/mqueue: correct mq_attr_ok test
ipc/mqueue: improve performance of send/recv
selftests: add mq_open_tests
...
ULONG_MAX is often used to check for integer overflow when calculating
allocation size. While ULONG_MAX happens to work on most systems, there
is no guarantee that `size_t' must be the same size as `long'.
This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
portability and readability for allocation size validation.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Acked-by: Alex Elder <elder@dreamhost.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ceph updates from Sage Weil:
"There are some updates and cleanups to the CRUSH placement code, a bug
fix with incremental maps, several cleanups and fixes from Josh Durgin
in the RBD block device code, a series of cleanups and bug fixes from
Alex Elder in the messenger code, and some miscellaneous bounds
checking and gfp cleanups/fixes."
Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".
* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
libceph: fix pg_temp updates
libceph: avoid unregistering osd request when not registered
ceph: add auth buf in prepare_write_connect()
ceph: rename prepare_connect_authorizer()
ceph: return pointer from prepare_connect_authorizer()
ceph: use info returned by get_authorizer
ceph: have get_authorizer methods return pointers
ceph: ensure auth ops are defined before use
ceph: messenger: reduce args to create_authorizer
ceph: define ceph_auth_handshake type
ceph: messenger: check return from get_authorizer
ceph: messenger: rework prepare_connect_authorizer()
ceph: messenger: check prepare_write_connect() result
ceph: don't set WRITE_PENDING too early
ceph: drop msgr argument from prepare_write_connect()
ceph: messenger: send banner in process_connect()
ceph: messenger: reset connection kvec caller
libceph: don't reset kvec in prepare_write_banner()
ceph: ignore preferred_osd field
ceph: fully initialize new layout
...
Use parent_inode has a flag for whether nfsd wants a connectable fh, but
generate one opportunistically so that we can take advantage of the
additional info in there.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass inode + parent's inode or NULL instead of dentry + bool saying
whether we want the parent or not.
NOTE: that needs ceph fix folded in.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value. This is to pave the way for making use of the
returned value in an upcoming patch.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In the create_authorizer method for both the mds and osd clients,
the auth_client->ops pointer is blindly dereferenced. There is no
obvious guarantee that this pointer has been assigned. And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.
Add checks in both routines to make sure they are defined (non-null)
before use. Add similar checks in a few other spots in these files
while we're at it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Make use of the new ceph_auth_handshake structure in order to reduce
the number of arguments passed to the create_authorizor method in
ceph_auth_client_ops. Use a local variable of that type as a
shorthand in the get_authorizer method definitions.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to "authorizers." Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.
Fix the #includes in "linux/ceph/osd_client.h" to lay out their more
complete canonical path.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Old users may not expect EINVAL, and there is no clear user-visibile
behavior change now that we ignore it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
When we are setting a new layout, fully initialize the structure:
- zero it out
- always set preferred_osd to -1
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Both of these methods perform similar checks; move that code to a helper
so that we can ensure the checks are consistent.
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
This was an ill-conceived feature that has been removed from Ceph. Do
this gracefully:
- reject attempts to specify a preferred_osd via the ioctl
- stop exposing this information via virtual xattrs
- always fill in -1 for requests, in case we talk to an older server
- don't calculate preferred_osd placements/pgids
Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Pull Ceph updates for 3.4-rc1 from Sage Weil:
"Alex has been busy. There are a range of rbd and libceph cleanups,
especially surrounding device setup and teardown, and a few critical
fixes in that code. There are more cleanups in the messenger code,
virtual xattrs, a fix for CRC calculation/checks, and lots of other
miscellaneous stuff.
There's a patch from Amon Ott to make inos behave a bit better on
32-bit boxes, some decode check fixes from Xi Wang, and network
throttling fix from Jim Schutt, and a couple RBD fixes from Josh
Durgin.
No new functionality, just a lot of cleanup and bug fixing."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
rbd: move snap_rwsem to the device, rename to header_rwsem
ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
libceph: isolate kmap() call in write_partial_msg_pages()
libceph: rename "page_shift" variable to something sensible
libceph: get rid of zero_page_address
libceph: only call kernel_sendpage() via helper
libceph: use kernel_sendpage() for sending zeroes
libceph: fix inverted crc option logic
libceph: some simple changes
libceph: small refactor in write_partial_kvec()
libceph: do crc calculations outside loop
libceph: separate CRC calculation from byte swapping
libceph: use "do" in CRC-related Boolean variables
ceph: ensure Boolean options support both senses
libceph: a few small changes
libceph: make ceph_tcp_connect() return int
libceph: encapsulate some messenger cleanup code
libceph: make ceph_msgr_wq private
libceph: encapsulate connection kvec operations
libceph: move prepare_write_banner()
...
In ceph_vxattrcb_file_layout(), there is a check to determine
whether a preferred PG should be formatted into the output buffer.
That check assumes that a preferred PG number of 0 indicates "no
preference," but that is wrong. No preference is indicated by a
negative (specifically, -1) PG number.
In addition, if that condition yields true, the preferred value
is formatted into a sized buffer, but the size consumed by the
earlier snprintf() call is not accounted for, opening up the
possibilty of a buffer overrun.
Finally, in ceph_vxattrcb_dir_rctime() where the nanoseconds part of
the time displayed did not include leading 0's, which led to
erroneous (sub-second portion of) time values being shown.
This fixes these three issues:
http://tracker.newdream.net/issues/2155http://tracker.newdream.net/issues/2156http://tracker.newdream.net/issues/2157
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
Many ceph-related Boolean options offer the ability to both enable
and disable a feature. For all those that don't offer this, add
a new option so that they do.
Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful. With this interface is not evident at call sites that
the pointer is always initialized. Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This patch just rearranges a few bits of code to make more
portions of ceph_setxattr() and ceph_removexattr() identical.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
All names defined in the directory and file virtual extended
attribute tables are constant, and the size of each is known at
compile time. So there's no need to compute their length every
time any file's attribute is listed.
Record the length of each string and use it when needed to determine
the space need to represent them. In addition, compute the
aggregate size of strings in each table just once at initialization
time.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The names of the callback functions used for virtual extended
attributes are based only on the last component of the attribute
name. Because of the way these are defined, this precludes allowing
a single (lowest) attribute name for different callbacks, dependent
on the type of file being operated on. (For example, it might be
nice to support both "ceph.dir.layout" and "ceph.file.layout".)
Just change the callback names to avoid this problem.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
A struct ceph_vxattr_cb does not represent a callback at all, but
rather a virtual extended attribute itself. Drop the "_cb" suffix
from its name to reflect that.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Entries in the ceph virtual extended attribute tables all follow a
distinct pattern in their definition. Enforce this pattern through
the use of a macro.
Also, a null name field signals the end of the table, so make that
be the first field in the ceph_vxattr_cb structure.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Use symbolic constants to define the top-level prefix for "ceph."
extended attribute names.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
All callers of ceph_match_vxattr() determine what to pass as the
first argument by calling ceph_inode_vxattrs(inode). Just do that
inside ceph_match_vxattr() itself, changing it to take an inode
rather than the vxattr pointer as its first argument.
Also ensure the function works correctly for an empty table (i.e.,
containing only a terminating null entry).
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
For some reason, ceph_setxattr() allocates an extra byte in which a
'\0' is stored past the end of an extended attribute value. This is
not needed, and is potentially misleading, so get rid of it.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The overflow check for a + n * b should be (n > (ULONG_MAX - a) / b),
rather than (n > ULONG_MAX / b - a).
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size
do not match.
Also use kstrndup rather than kmalloc/memcpy.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
The root directory of the Ceph mount has inode number 1, so falling back
to 1 always creates a collision. 2 is unused on my test systems and seems
less likely to collide.
Signed-off-by: Amon Ott <ao@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Avoid the need to check for a special zero s_cap_ttl value by just
using (jiffies - 1) as the value assigned to indicate "sometime in
the past."
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix safety of rbd_put_client()
rbd: fix a memory leak in rbd_get_client()
ceph: create a new session lock to avoid lock inversion
ceph: fix length validation in parse_reply_info()
ceph: initialize client debugfs outside of monc->mutex
ceph: change "ceph.layout" xattr to be "ceph.file.layout"
Lockdep was reporting a possible circular lock dependency in
dentry_lease_is_valid(). That function needs to sample the
session's s_cap_gen and and s_cap_ttl fields coherently, but needs
to do so while holding a dentry lock. The s_cap_lock field was
being used to protect the two fields, but that can't be taken while
holding a lock on a dentry within the session.
In most cases, the s_cap_gen and s_cap_ttl fields only get operated
on separately. But in three cases they need to be updated together.
Implement a new lock to protect the spots updating both fields
atomically is required.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
"len" is read from network and thus needs validation. Otherwise, given
a bogus "len" value, p+len could be an out-of-bounds pointer, which is
used in further parsing.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The virtual extended attribute named "ceph.layout" is meaningful
only for regular files. Change its name to be "ceph.file.layout" to
more directly reflect that in the ceph xattr namespace. Preserve
the old "ceph.layout" name for the time being (until we decide it's
safe to get rid of it entirely).
Add a missing initializer for "readonly" in the terminating entry.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation
Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes
are marked dirty, all attributes recorded in its rb_tree index are
formatted into a "blob" buffer. The target buffer is recorded in
ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and
be of sufficient size to hold the attributes.
The extended attributes are marked dirty in two cases: when a new
attribute is added to the inode; or when one is removed. In the
former case work is done to ensure the prealloc_blob buffer is
properly set up, but in the latter it is not.
Change the logic in ceph_removexattr() so it matches what is
done in ceph_setxattr(). Note that this is done in a way that
keeps the two blocks of code nearly identical, in anticipation
of a subsequent patch that encapsulates some of this logic into
one or more helper routines.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Enable/disable use of the dentry dir 'complete' flag via a mount option.
This lets the admin control whether ceph uses the dcache to satisfy
negative lookups or readdir when it has the entire directory contents in
its cache.
This is purely a performance optimization; correctness is guaranteed
whether it is enabled or not.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
When open_root_dentry() gets a dentry via d_obtain_alias() it does
not get initialized. If the dentry obtained came from the cache,
this is OK. But if not, the result is an improperly initialized
dentry.
To fix this, call ceph_init_dentry() regardless of which path
produced the dentry. That function returns immediately for a dentry
that is already initialized, it is safe to use either way.
(Credit to Sage, who suggested this fix.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Nothing we do here sleeps, so just do it under d_lock and avoid the dget/
dput entirely.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@newdream.net>
We now set d_fsdata unconditionally on all dentries prior to setting up
the d_ops, so all of these checks are unnecessary.
Signed-off-by: Sage Weil <sage@newdream.net>
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ceph attempts to use the dcache to satisfy negative lookups and readdir
when the entire directory contents are in cache. Disable this behavior
until lingering bugs in this code are shaken out; we'll re-enable these
hooks once things are fully stable.
Signed-off-by: Sage Weil <sage@newdream.net>
Commit 06222e491e got the if wrong so that
it always evaluates as true. This is semantically harmless, but makes
SEEK_CUR and SEEK_SET needlessly query the server.
Rewrite the if to explicitly enumerate the cases we DO need a valid i_size
to make this code less fragile.
Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction. That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.
Changing the list lock ordering would be a painful process.
However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().
Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Set up d_fsdata on the root dentry. This fixes a NULL pointer dereference
in ceph_d_prune on umount. It also means we can eventually strip out all
of the conditional checks on d_fsdata because it is now set unconditionally
(prior to setting up the d_ops).
Fix the ceph_d_prune debug print while we're here.
Signed-off-by: Sage Weil <sage@newdream.net>
If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.
That is, instead of
if (queue_work(...))
ihold();
we need to do
ihold();
if (!queue_work(...))
iput();
Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Quiet the sparse noise:
warning: symbol 'create_fs_client' was not declared. Should it be static?
warning: symbol 'destroy_fs_client' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Quiet the following sparse noise:
warning: symbol 'get_nonsnap_parent' was not declared. Should it be static?
warning: symbol 'done_closing_sessions' was not declared. Should it be static?
Local functions don't need external visability. Make them static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().
Signed-off-by: Sage Weil <sage@newdream.net>
When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag
on the parent dentry. Do this for the live and snapshotted namespaces. Do
not bother for the .snap dir contents, since we do not cache that.
Signed-off-by: Sage Weil <sage@newdream.net>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously we were validating the passed-in stripe unit, object size,
and stripe count against each other (and not testing most other stuff).
Instead, make sure that the composed previous layout and new values are valid,
and only send the new values to the MDS. This lets users change the
pool without setting the whole layout, for instance.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
This reverts commit c9af9fb68e.
We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:
- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data
It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.
Signed-off-by: Sage Weil <sage@newdream.net>
The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)
Mark msg allocations whose failure is safely handled as such.
Signed-off-by: Sage Weil <sage@newdream.net>
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.
Signed-off-by: Sage Weil <sage@newdream.net>
The 'rsize' mount option limits the maximum size of an individual
read(ahead) operation that is sent off to an OSD. This is distinct from
'rasize', which controls the size of the readahead window.
Signed-off-by: Sage Weil <sage@newdream.net>
When we get a ->readpages() aop, submit async reads for all page ranges
in the provided page list. Lock the pages immediately, so that VFS/MM
will block until the reads complete.
Signed-off-by: Sage Weil <sage@newdream.net>
kfree does not clean up indirect allocations in
ceph_fs_client and ceph_options (e.g. snapdir_name).
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
A 'path' consists of a starting ino and relative component. Encode even
when there is no relative component. This is primarily needed by the
NFS reexport code.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: document unlocked d_parent accesses
ceph: explicitly reference rename old_dentry parent dir in request
ceph: document locking for ceph_set_dentry_offset
ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
ceph: protect d_parent access in ceph_d_revalidate
ceph: protect access to d_parent
ceph: handle racing calls to ceph_init_dentry
ceph: set dir complete frag after adding capability
rbd: set blk_queue request sizes to object size
ceph: set up readahead size when rsize is not passed
rbd: cancel watch request when releasing the device
ceph: ignore lease mask
ceph: fix ceph_lookup_open intent usage
ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
ceph: fix bad parent_inode calc in ceph_lookup_open
ceph: avoid carrying Fw cap during write into page cache
libceph: don't time out osd requests that haven't been received
ceph: report f_bfree based on kb_avail rather than diffing.
ceph: only queue capsnap if caps are dirty
ceph: fix snap writeback when racing with writes
...
For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine. Document that, and
do some minor cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We carry a pin on the parent directory for the rename source and dest
dentries. For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.
While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.
Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Protect d_parent with d_lock. Carry a reference. Simplify the flow so
that there is a single exit point and cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The ->lookup() and prepopulate_readdir() callers are working with unhashed
dentries, so we don't have to worry. The export.c callers, though, need
to initialize something they got back from d_obtain_alias() and are
potentially racing with other callers. Make sure we don't return unless
the dentry is properly initialized (by us or someone else).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This should improve the default read performance, as without it
readahead is practically disabled.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors. So:
- use separate helper for the hidden snapdir translation, immediately
following the mds request
- use ceph_finish_lookup for the final dentry/return value dance in the
exit path
- lookup_instantiate_filp on success
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We only need to put these on the directory unsafe list if they have
side effects that fsync(2) should flush out.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
We were always getting NULL here because the intent file f_dentry is always
NULL at this point, which means we were always passing NULL to
ceph_mdsc_do_request. In reality, this was fine, since this isn't
currently ever a write operation that needs to get strung on the dir's
unsafe list.
Use the dir explicitly, and only pass it if this open has side-effects that
a dir fsync should flush.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The generic_file_aio_write call may block on balance_dirty_pages while we
flush data to the OSDs. If we hold a reference to the FILE_WR cap during
that interval revocation by the MDS (e.g., to do a stat(2)) may be very
slow.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>