rcu_dereference_check_fdtable() looks very wrong,
1. rcu_my_thread_group_empty() was added by 844b9a8707 "vfs: fix
RCU-lockdep false positive due to /proc" but it doesn't really
fix the problem. A CLONE_THREAD (without CLONE_FILES) task can
hit the same race with get_files_struct().
And otoh rcu_my_thread_group_empty() can suppress the correct
warning if the caller is the CLONE_FILES (without CLONE_THREAD)
task.
2. files->count == 1 check is not really right too. Even if this
files_struct is not shared it is not safe to access it lockless
unless the caller is the owner.
Otoh, this check is sub-optimal. files->count == 0 always means
it is safe to use it lockless even if files != current->files,
but put_files_struct() has to take rcu_read_lock(). See the next
patch.
This patch removes the buggy checks and turns fcheck_files() into
__fcheck_files() which uses rcu_dereference_raw(), the "unshared"
callers, fget_light() and fget_raw_light(), can use it to avoid
the warning from RCU-lockdep.
fcheck_files() is trivially reimplemented as rcu_lockdep_assert()
plus __fcheck_files().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
kill pointless method instances and don't bother with ->owner - it's
ignored for procfs files anyway, make use of remove_proc_subtree() for
removal, get rid of cell->proc_dir.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass owner explicitly to __register_nls(), make register_nls() a macro passing
THIS_MODULE as the owner argument to __register_nls().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't assume that ->dest_count won't change between copy_from_user()
and memdup_user()
* use fdget instead of fget
* don't bother comparing superblocks when we'd already compared vfsmounts
* get rid of excessive goto
* use file_inode() instead of open-coding the sucker
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pass on-disk superblock to qnx4_chkroot() explicitly
* don't leave stale (and unused) pointers in qnx4_super_block
* free stuff in ->kill_sb(); ->put_super() becomes empty and dies
* simplify failure exits
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If ecryptfs_readlink_lower() fails, buf remains an uninitialized
pointer and passing it nd_set_link() won't do anything good.
Fixed by switching ecryptfs_readlink_lower() to saner API - make it
return buf or ERR_PTR(...) and update callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A struct svc_fh is 320 bytes on x86_64, it'd be better not to have these
on the stack.
kmalloc'ing them probably isn't ideal either, but this is the simplest
thing to do. If it turns out to be a problem in the readdir case then
we could add a svc_fh to nfsd4_readdir and pass that in.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Some time ago, mkfs.xfs started picking the storage physical
sector size as the default filesystem "sector size" in order
to avoid RMW costs incurred by doing IOs at logical sector
size alignments.
However, this means that for a filesystem made with i.e.
a 4k sector size on an "advanced format" 4k/512 disk,
512-byte direct IOs are no longer allowed. This means
that XFS has essentially turned this AF drive into a hard
4K device, from the filesystem on up.
XFS's mkfs-specified "sector size" is really just controlling
the minimum size & alignment of filesystem metadata.
There is no real need to tightly couple XFS's minimal
metadata size to the minimum allowed direct IO size;
XFS can continue doing metadata in optimal sizes, but
still allow smaller DIOs for apps which issue them,
for whatever reason.
This patch adds a new field to the xfs_buftarg, so that
we now track 2 sizes:
1) The metadata sector size, which is the minimum unit and
alignment of IO which will be performed by metadata operations.
2) The device logical sector size
The first is used internally by the file system for metadata
alignment and IOs.
The second is used for the minimum allowed direct IO alignment.
This has passed xfstests on filesystems made with 4k sectors,
including when run under the patch I sent to ignore
XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly
tested end of block behavior on preallocated, sparse, and
existing files when we do a 512 IO into a 4k file on a
4k-sector filesystem, to be sure there were no unexpected
behaviors.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In preparation for adding new members to the structure,
give these old ones more descriptive names:
bt_ssize -> bt_meta_sectorsize
bt_smask -> bt_meta_sectormask
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Clean up the xfs_buftarg structure a bit:
- remove bt_bsize which is never used
- replace bt_sshift with bt_ssize; we only ever shift it back
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Getting an inode by romfs_iget may lead to an err in fill_super, and the
err value should be return.
And it should return -ENOMEM instead while d_make_root fails, fix it too.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So far, POSIX ACLs are using a canonical representation that keeps all ACL
entries in a strict order; the ACL_USER and ACL_GROUP entries for specific
users and groups are ordered by user and group identifier, respectively.
The user-space code provides ACL entries in this order; the kernel
verifies that the ACL entry order is correct in posix_acl_valid().
User namespaces allow to arbitrary map user and group identifiers which
can cause the ACL_USER and ACL_GROUP entry order to differ between user
space and the kernel; posix_acl_valid() would then fail.
Work around this by allowing ACL_USER and ACL_GROUP entries to be in any
order in the kernel. The effect is only minor: file permission checks
will pick the first matching ACL_USER entry, and check all matching
ACL_GROUP entries.
(The libacl user-space library and getfacl / setfacl tools will not create
ACLs with duplicate user or group idenfifiers; they will handle ACLs with
entries in an arbitrary order correctly.)
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
use do{}while - more efficient and it squishes a coccinelle warning
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently both setup_new_exec() and flush_old_exec() issue a call to
arch_pick_mmap_layout(). As setup_new_exec() and flush_old_exec() are
always called pairwise arch_pick_mmap_layout() is called twice.
This patch removes one call from setup_new_exec() to have it only called
once.
Signed-off-by: Richard Weinberger <richard@nod.at>
Tested-by: Pat Erley <pat-lkml@erley.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be
a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread
can do kernel_thread() to create the child.
Clearing this flag in usersapce child to enable its migrating capability.
Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the remaining next_thread (ab)users to use while_each_thread().
The last user which should be changed is next_tid(), but we can't do this
now.
__exit_signal() and complete_signal() are fine, they actually need
next_thread() logic.
This patch (of 3):
do_task_stat() can use while_each_thread(), no changes in
the compiled code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually
exclusive. The patch kills ->did_exec because it has a single user.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both success/failure paths cleanup bprm->file, we can move this
code into free_bprm() to simlify and cleanup this logic.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs_struct->in_exec == T means that this ->fs is used by a single process
(thread group), and one of the treads does do_execve().
To avoid the mt-exec races this code has the following complications:
1. check_unsafe_exec() returns -EBUSY if ->in_exec was
already set by another thread.
2. do_execve_common() records "clear_in_exec" to ensure
that the error path can only clear ->in_exec if it was
set by current.
However, after 9b1bf12d5d "signals: move cred_guard_mutex from
task_struct to signal_struct" we do not need these complications:
1. We can't race with our sub-thread, this is called under
per-process ->cred_guard_mutex. And we can't race with
another CLONE_FS task, we already checked that this fs
is not shared.
We can remove the dead -EAGAIN logic.
2. "out_unmark:" in do_execve_common() is either called
under ->cred_guard_mutex, or after de_thread() which
kills other threads, so we can't race with sub-thread
which could set ->in_exec. And if ->fs is shared with
another process ->in_exec should be false anyway.
We can clear in_exec unconditionally.
This also means that check_unsafe_exec() can be void.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
next_thread() should be avoided, change check_unsafe_exec() to use
while_each_thread().
Nobody except signal->curr_target actually needs next_thread-like code,
and we need to change (fix) this interface. This particular code is fine,
p == current. But in general the code like this can loop forever if p
exits and next_thread(t) can't reach the unhashed thread.
This also saves 32 bytes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PROC_FS is a bool, so this code is either present or absent. It will
never be modular, so using module_init as an alias for __initcall is
rather misleading.
Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h to
obviously non-modular code, and that would be ugly at best.
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change these registrations from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.
Also note that this change uncovers a missing semicolon bug in the
registration of vmcore_init as an initcall.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial cleanup to eliminate a goto.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Distribution kernels might want to build in support for /proc/device-tree
for kernels that might end up running on hardware that doesn't support
openfirmware. This results in an empty /proc/device-tree existing.
Remove it if the OFW root node doesn't exist.
This situation actually confuses grub2, resulting in install failures.
grub2 sees the /proc/device-tree and picks the wrong install target cf.
http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311
grub should be more robust, but still, leaving an empty proc dir seems
pointless.
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use existing accessors proc_set_user() and proc_set_size() to set
attributes. Just a cleanup.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. proc_task_readdir()->first_tid() path truncates f_pos to int, this
is wrong even on 64bit.
We could check that f_pos < PID_MAX or even INT_MAX in
proc_task_readdir(), but this patch simply checks the potential
overflow in first_tid(), this check is nop on 64bit. We do not care if
it was negative and the new unsigned value is huge, all we need to
ensure is that we never wrongly return !NULL.
2. Remove the 2nd "nr != 0" check before get_nr_threads(),
nr_threads == 0 is not distinguishable from !pid_task() above.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_task_readdir() does not really need "leader", first_tid() has to
revalidate it anyway. Just pass proc_pid(inode) to first_tid() instead,
it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if
necessary.
The patch also extracts the "inode is dead" code from
pid_delete_dentry(dentry) into the new trivial helper,
proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT
if this dir was removed.
This is a bit racy, but the race is very inlikely and the getdents() after
openndir() can see the empty "." + ".." dir only once.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rerwrite the main loop to use while_each_thread() instead of
next_thread(). We are going to fix or replace while_each_thread(),
next_thread() should be avoided whenever possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_task_readdir() verifies that the result of get_proc_task() is
pid_alive() and thus its ->group_leader is fine too. However this is not
necessarily true after rcu_read_unlock(), we need to recheck this again
after first_tid() does rcu_read_lock(). Otherwise
leader->thread_group.next (used by next_thread()) can be invalid if the
rcu grace period expires in between.
The race is subtle and unlikely, but still it is possible afaics. To
simplify lets ignore the "likely" case when tid != 0, f_version can be
cleared by proc_task_operations->llseek().
Suppose we have a main thread M and its subthread T. Suppose that f_pos
== 3, iow first_tid() should return T. Now suppose that the following
happens between rcu_read_unlock() and rcu_read_lock():
1. T execs and becomes the new leader. This removes M from
->thread_group but next_thread(M) is still T.
2. T creates another thread X which does exec as well, T
goes away.
3. X creates another subthread, this increments nr_threads.
4. first_tid() does next_thread(M) and returns the already
dead T.
Note also that we need 2. and 3. only because of get_nr_threads() check,
and this check was supposed to be optimization only.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sameer Nanda <snanda@chromium.org>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_task_state() and task_state_array[] look confusing and suboptimal, it
is not clear what it can actually report to user-space and
task_state_array[] blows .data for no reason.
1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
clear. TASK_REPORT is self-documenting but it is not clear
what ->exit_state can add.
Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
into TASK_REPORT and use it to calculate the final result.
2. With the change above it is obvious that task_state_array[]
has the unused entries just to make BUILD_BUG_ON() happy.
Change this BUILD_BUG_ON() to use TASK_REPORT rather than
TASK_STATE_MAX and shrink task_state_array[].
3. Turn the "while (state)" loop into fls(state).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Remove fs/coredump.h. It is not clear why do we need it,
it only declares __get_dumpable(), signal.c includes it
for no reason.
2. Now that get_dumpable() and __get_dumpable() are really
trivial make them inline in linux/sched.h.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alex Kelly <alex.page.kelly@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody actually needs MMF_DUMPABLE/MMF_DUMP_SECURELY, they are only used
to enforce the encoding of SUID_DUMP_* enum in mm->flags &
MMF_DUMPABLE_MASK.
Now that set_dumpable() updates both bits atomically we can kill them and
simply store the value "as is" in 2 lower bits.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alex Kelly <alex.page.kelly@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
set_dumpable() updates MMF_DUMPABLE_MASK in a non-trivial way to ensure
that get_dumpable() can't observe the intermediate state, but this all
can't help if multiple threads call set_dumpable() at the same time.
And in theory commit_creds()->set_dumpable(SUID_DUMP_ROOT) racing with
sys_prctl()->set_dumpable(SUID_DUMP_DISABLE) can result in SUID_DUMP_USER.
Change this code to update both bits atomically via cmpxchg().
Note: this assumes that it is safe to mix bitops and cmpxchg. IOW, if,
say, an architecture implements cmpxchg() using the locking (like
arch/parisc/lib/bitops.c does), then it should use the same locks for
set_bit/etc.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alex Kelly <alex.page.kelly@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HFS+ resource fork lookup breaks opendir() library function. Since
opendir first calls open() with O_DIRECTORY flag set. O_DIRECTORY means
"refuse to open if not a directory". The open system call in the kernel
does a check for inode->i_op->lookup and returns -ENOTDIR. So if
hfsplus_file_lookup is set it allows opendir() for plain files.
Also resource fork lookup in HFS+ does not work. Since it is never
invoked after VFS permission checking. It will always return with
-EACCES.
When we call opendir() on a file, it does not return NULL. opendir()
library call is based on open with O_DIRECTORY flag passed and then
layered on top of getdents() system call. O_DIRECTORY means "refuse to
open if not a directory".
The open() system call in the kernel does a check for: do_sys_open()
-->..--> can_lookup() i.e it only checks inode->i_op->lookup and returns
ENOTDIR if this function pointer is not set.
In OSX, we can open "file/rsrc" to get the resource fork of "file". This
behavior is emulated inside hfsplus on Linux, which means that to some
degree every file acts like a directory. That is the reason lookup()
inode operations is supported for files, and it is possible to do a lookup
on this specific name. As a result of this open succeeds without
returning ENOTDIR for HFS+
Please see the LKML discussion thread on this issue:
http://marc.info/?l=linux-fsdevel&m=122823343730412&w=2
I tried to test file/rsrc lookup in HFS+ driver and the feature does not
work. From OSX:
$ touch test
$ echo "1234" > test/..namedfork/rsrc
$ ls -l test..namedfork/rsrc
--rw-r--r-- 1 tuxera staff 5 10 dec 12:59 test/..namedfork/rsrc
[sougata@ultrabook tmp]$ id
uid=1000(sougata) gid=1000(sougata) groups=1000(sougata),5(tty),18(dialout),1001(vboxusers)
[sougata@ultrabook tmp]$ mount
/dev/sdb1 on /mnt/tmp type hfsplus (rw,relatime,umask=0,uid=1000,gid=1000,nls=utf8)
[sougata@ultrabook tmp]$ ls -l test/rsrc
ls: cannot access test/rsrc: Permission denied
According to this LKML thread it is expected behavior.
http://marc.info/?t=121139033800008&r=1&w=4
I guess now that permission checking happens in vfs generic_permission() ?
So it turns out that even though the lookup() inode_operation exists for
HFS+ files. It cannot really get invoked ?. So if we can disable this
feature to make opendir() work for HFS+.
Signed-off-by: Sougata Santra <sougata@tuxera.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add comments for ioctls in fs/nilfs2/ioctl.c file and describe NILFS2
specific ioctls in Documentation/filesystems/nilfs2.txt.
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The local variable 'pos' in nilfs_ioctl_wrap_copy function can overflow if
a large number was passed to argv->v_index from userspace and the sum of
argv->v_index and argv->v_nmembs exceeds the maximum value of __u64 type
integer (= ~(__u64)0 = 18446744073709551615).
Here, argv->v_index is a 64-bit width argument to specify the start
position of target data items (such as segment number, checkpoint number,
or virtual block address of nilfs), and argv->v_nmembs gives the total
number of the items that userland programs (such as lssu, lscp, or
cleanerd) want to get information about, which also gives the maximum
element count of argv->v_base[] array.
nilfs_ioctl_wrap_copy() calls dofunc() repeatedly and increments the
position variable 'pos' at the end of each iteration if dofunc() itself
didn't update 'pos':
if (pos == ppos)
pos += n;
This patch prevents the overflow here by rejecting pairs of a start
position (argv->v_index) and a total count (argv->v_nmembs) which leads to
the overflow.
[konishi.ryusuke@lab.ntt.co.jp: fix signedness issue]
Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pipe has no data associated with fs so it is not good idea to block
pipe_write() if FS is frozen, but we can not update file's time on such
filesystem. Let's use same idea as we use in touch_time().
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=65701
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs4 module doesn't consider symlinks for expire as it did in the
older autofs v3 module (so it's actually a long standing regression).
The user space daemon has focused on the use of bind mounts instead of
symlinks for a long time now and that's why this has not been noticed.
But with the future addition of amd map parsing to automount(8), not to
mention amd itself (of am-utils), symlink expiry will be needed.
The direct and offset mount types can't be symlinks and the tree mounts of
version 4 were always real mounts so only indirect mounts need expire
symlinks.
Since the current users of the autofs4 module haven't reported this as a
problem to date this patch probably isn't a candidate for backport to
stable.
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the helper macro !IS_ROOT to replace parent != dentry->d_parent. Just
clean up.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While kzallocing sbi/ino fails, it should return -ENOMEM.
And it should return the err value from autofs_prepare_pipe.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PID and the TGID of the process triggering the mount are sent to the
daemon. Currently the global pid values are sent (ones valid in the
initial pid namespace) but this is wrong if the autofs daemon itself is
not running in the initial pid namespace.
So send the pid values that are valid in the namespace of the autofs
daemon.
The namespace to use is taken from the oz_pgrp pid pointer, which was
set at mount time to the mounting process' pid namespace.
If the pid translation fails (the triggering process is in an unrelated
pid namespace) then the automount fails with ENOENT.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable autofs4 to work in a "container". oz_pgrp is converted from
pid_t to struct pid and this is stored at mount time based on the
"pgrp=" option or if the option is missing then the current pgrp.
The "pgrp=" option is interpreted in the PID namespace of the current
process. This option is flawed in that it doesn't carry the namespace
information, so it should be deprecated. AFAICS the autofs daemon
always sends the current pgrp, which is the default anyway.
The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl.
This ioctl sets oz_pgrp to the current pgrp. It is not allowed to
change the pid namespace.
oz_pgrp is used mainly to determine whether the process traversing the
autofs mount tree is the autofs daemon itself or not. This function now
compares the pid pointers instead of the pid_t values.
One other use of oz_pgrp is in autofs4_show_options. There is shows the
virtual pid number (i.e. the one that is valid inside the PID namespace
of the calling process)
For debugging printk convert oz_pgrp to the value in the initial pid
namespace.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ramfs_aops is identical in file-mmu.c and file-nommu.c. Thus move it to
fs/ramfs/inode.c and make it static.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 853ac43ab1 ("shmem: unify regular and tiny shmem"),
ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() are not directly
referenced outside of file-nommu.c. Thus make them static.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two defines are unused since the removal of the a.out interpreter
support in the ELF loader in kernel 2.6.25
Signed-off-by: Todor Minchev <todor@minchev.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Sage Weil <sage@inktank.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The symbol U32_MAX is defined in several spots. Change these
definitions to be conditional. This is in preparation for the next
patch, which centralizes the definition in <linux/kernel.h>.
Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In get_mapping_page(), after calling find_or_create_page(), the return
value should be checked.
This patch has been provided:
http://www.spinics.net/lists/linux-fsdevel/msg66948.html but not been
applied now.
Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Reviewed-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not. But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec. This happens only
for a few seconds after LRU list operations, but it makes it difficult
to control our applications depending on this flag.
So this patch adds another check PageAnon to detect thps on pagevec. It
might not give the future extensibility for thp pagecache, but it's OK
at least for now.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We stick an extra svc_fh in nfsd3_readdirres to save the need to
kmalloc, though maybe it would be fine to kmalloc instead.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull UDF & jbd fixes from Jan Kara:
"A cleanup of JBD log messages and UDF fix of a lockdep warning"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix lockdep warning from udf_symlink()
jbd: Revise KERN_EMERG error messages
Pull fuse update from Miklos Szeredi:
"This contains a fix for a potential use-after-module-unload bug
noticed by Al and caching improvements for read-only fuse filesystems
by Andrew Gallagher"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: support clients that don't implement 'open'
fuse: don't invalidate attrs when not using atime
fuse: fix SetPageUptodate() condition in STORE
fuse: fix pipe_buf_operations
This patch-set includes the following major enhancement patches.
o support inline_data
o refactor bio operations such as merge operations and rw type assignment
o enhance the direct IO path
o enhance bio operations
o truncate a node page when it becomes obsolete
o add sysfs entries: small_discards, max_victim_search, and in-place-update
o add a sysfs entry to control max_victim_search
The other bug fixes are as follows.
o fix a bug in truncate_partial_nodes
o avoid warnings during sparse and build process
o fix error handling flows
o fix potential bit overflows
And, there are a bunch of cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJS4HQfAAoJEEAUqH6CSFDSyyMP/iUXSMC9yw6eOmSjAh3boc6+
C7e4zrhdovekGTuZgg41SLdr83cpbEohv11wcXAfxB+eYFEz0zrAVzt54zMi7uOL
9JmFJ6XVL/T3omI5hpEwWHg6S6tOynN6mcjacsrvypEekgjHbbpLudSw6SCu3dKz
Lpc3z6CxrWbhvX8Iyf1j8mCceWkTO6eRv7u2H4Njtsq4Tukw3BHiBsURXt6kGwpx
CvRBgCFdQhv4GAtbDosmVjNWOUxvik7w2epHAPQGddFTgaCL9uS+gfweHK6H9EDp
1e3BDhmn5r9IhiLY8KVXRc8+po9kQeO1jNQATBuWggfjJSGbEBmrEQX4MFE3uCi9
q84hGV9+yaJxoT2A21qIeWgorF9gjqNbnrrENKHyKhOqXJSrh48u5LUV8KqIyz1Y
Qw62cypEB+PQxWegN76vwX/OrHMCLYMQ6c78bYLSwkBKonOrF5sN2+kJW5+zEj6n
q2cYi1PLMJe7LTcULUrxJTSPFLKM5yA2oYZq3LN4sUYBeN6USaouaIqcZBqRBTCO
adqlTa3sWytkDMAHsTpwrHABKK7pwiZoPLDVwjo0TIJ6Us4JhDtTktp5pj24fQ7Y
6lC9w4VbfAKtq8fMV17rZYD0lQFlmZk4uQRJ8XYicCRFx11kMPKYzdGmP5aVXWru
wxcztktnABtCAXK0PFLf
=gVDh
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, a couple of sysfs entries were introduced to tune the
f2fs at runtime.
In addition, f2fs starts to support inline_data and improves the
read/write performance in some workloads by refactoring bio-related
flows.
This patch-set includes the following major enhancement patches.
- support inline_data
- refactor bio operations such as merge operations and rw type
assignment
- enhance the direct IO path
- enhance bio operations
- truncate a node page when it becomes obsolete
- add sysfs entries: small_discards, max_victim_search, and
in-place-update
- add a sysfs entry to control max_victim_search
The other bug fixes are as follows.
- fix a bug in truncate_partial_nodes
- avoid warnings during sparse and build process
- fix error handling flows
- fix potential bit overflows
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (95 commits)
f2fs: drop obsolete node page when it is truncated
f2fs: introduce NODE_MAPPING for code consistency
f2fs: remove the orphan block page array
f2fs: add help function META_MAPPING
f2fs: move a branch for code redability
f2fs: call mark_inode_dirty to flush dirty pages
f2fs: clean checkpatch warnings
f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)
f2fs: avoid f2fs_balance_fs call during pageout
f2fs: add delimiter to seperate name and value in debug phrase
f2fs: use spinlock rather than mutex for better speed
f2fs: move alloc new orphan node out of lock protection region
f2fs: move grabing orphan pages out of protection region
f2fs: remove the needless parameter of f2fs_wait_on_page_writeback
f2fs: update documents and a MAINTAINERS entry
f2fs: add a sysfs entry to control max_victim_search
f2fs: improve write performance under frequent fsync calls
f2fs: avoid to read inline data except first page
f2fs: avoid to left uninitialized data in page when read inline data
f2fs: fix truncate_partial_nodes bug
...
For 3.14-rc1 there are fixes in the areas of remote attributes, discard,
growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS
file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing
unused macros, quotas, setattr, and freeing of inode clusters. The
in-memory and on-disk log format have been decoupled, a common helper to
calculate the number of blocks in an inode cluster has been added, and
handling of i_version has been pulled into the filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before user
quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJS4C2sAAoJENaLyazVq6ZOF/UP/A/FmscsEbgz+KHtsg1UXP+g
EMkrD8WOOzLnm7/GhkfvmmFLQrWrrNPmiP54MPJ/rxpfDb7rV60HgS0iHm13U+IR
0uPyk2NIQKG/Sj6FHCrgn4oh19B0tmer6lLE32UPrlNM7w/0NRm32Ms/jKz7WNvc
9Tqw3qNdtmP7leHG8EyD1ISzqUz6FNSC+qGyOTuBQUqG24LqsmZ1qD2Nw/HEz0ir
1YCqS+cjj8c3WaWMsdjEFdCvEIKS/sMYM+ZihATIl/5ggwtVobP7PH/4cG8Biby3
5wvcl3/jM+xjgGDC1YMQQeSADieus6ER4aZTjCvGLn9RajQTOBjP0QZTu2OHntTI
kD/52DfYErthGijS2qJcqXy11AaYtWQU2yTZXuMpaACIM1DDSD4NyGFP99BzbBaX
D4BgnC+vmfpbXO37PmophkeZwAvj+9K2BBG7X+g4sLynj/BtmvFCxIyK+SLHE3hN
kn+Pn03yTw0VGgvj0krXSllbYKE0LyLElaA6h0LejQgcOZM0W6Q8qaj+RODLitbw
HkQNKYCOMCzbdNu8rKAfup9UPZloiMukyMdMaw1MeCgCQSQLkZVxTegmVW9gzpIh
QJEARDaOnKB/G/CLCwbPZR09DBpg3yBt/eHjt0VnV+z2x+u9DTInAp6f013g7E/W
gU+vnBBPY1AYI8iiDraU
=t8nB
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"This is primarily bug fixes, many of which you already have. New
stuff includes a series to decouple the in-memory and on-disk log
format, helpers in the area of inode clusters, and i_version handling.
We decided to try to use more topic branches this release, so there
are some merge commits in there on account of that. I'm afraid I
didn't do a good job of putting meaningful comments in the first
couple of merges. Sorry about that. I think I have the hang of it
now.
For 3.14-rc1 there are fixes in the areas of remote attributes,
discard, growfs, memory leaks in recovery, directory v2, quotas, the
MAINTAINERS file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg,
removing unused macros, quotas, setattr, and freeing of inode
clusters. The in-memory and on-disk log format have been decoupled, a
common helper to calculate the number of blocks in an inode cluster
has been added, and handling of i_version has been pulled into the
filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock
filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before
user quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem
geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate"
* tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
xfs: assert that we hold the ilock for extent map access
xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
xfs: reinstate the ilock in xfs_readdir
xfs: add xfs_ilock_attr_map_shared
xfs: rename xfs_ilock_map_shared
xfs: remove xfs_iunlock_map_shared
xfs: no need to lock the inode in xfs_find_handle
xfs: use xfs_icluster_size_fsb in xfs_imap
xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
xfs: use xfs_icluster_size_fsb in xfs_bulkstat
xfs: introduce a common helper xfs_icluster_size_fsb
xfs: get rid of XFS_IALLOC_BLOCKS macros
xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
...
In debug mode exofs is too verbose. Hiding the real problems
remove some trivial stuff.
Also fix some other prints.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If there was an error in fetching an object or extracting
inode info from attributes. Which means corrupted storage.
Let it be an empty ZERO dated directory entry so it can be
deleted. Otherwise the all directory will be inaccessible.
This does not loose data, because if there is an orphan object
somewhere it will be recovered by fschk. But usually this only
means corrupted dir entry. The object was never generated and
only its link exist. This way we can delete the bad entry.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
With this minimal do nothing patch an application can open O_DIRECT
and then actually do buffered sync IO instead. But the aio API is
supported which is a good thing
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In the case of target returning OSD_ERR_PRI_CLEAR_PAGES when we
only sent for attributes don't crash on NULL bio.
This is an osd-target bug but don't crash regardless
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
At IO preparation we calculate the max pages at each device and
allocate a BIO per device of that size. The calculation was wrong
on some unaligned corner cases offset/length combination and would
make prepare return with -ENOMEM. This would be bad for pnfs-objects
that would in that case IO through MDS. And fatal for exofs were it
would fail writes with EIO.
Fix it by doing the proper math, that will work in all cases. (I
ran a test with all possible offset/length combinations this time
round).
Also when reading we do not need to allocate for the parity units
since we jump over them.
Also lower the max_io_length to take into account the parity pages
so not to allocate BIOs bigger than PAGE_SIZE
CC: Stable Kernel <stable@vger.kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Pull trivial tree updates from Jiri Kosina:
"Usual rocket science stuff from trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
neighbour.h: fix comment
sched: Fix warning on make htmldocs caused by wait.h
slab: struct kmem_cache is protected by slab_mutex
doc: Fix typo in USB Gadget Documentation
of/Kconfig: Spelling s/one/once/
mkregtable: Fix sscanf handling
lp5523, lp8501: comment improvements
thermal: rcar: comment spelling
treewide: fix comments and printk msgs
IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
Documentation: update /proc/uptime field description
Documentation: Fix size parameter for snprintf
arm: fix comment header and macro name
asm-generic: uaccess: Spelling s/a ny/any/
mtd: onenand: fix comment header
doc: driver-model/platform.txt: fix a typo
drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
doc: Fix typo (acces_process_vm -> access_process_vm)
treewide: Fix typos in printk
drivers/gpu/drm/qxl/Kconfig: reformat the help text
...
An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT
only when a Server Sent a RECALL do to that GET_LAYOUT, or
the RECALL and GET_LAYOUT crossed on the wire.
In any way this means we want to wait at most until in-flight IO
is finished and the RECALL can be satisfied.
So a proper wait here is more like 1/10 of a second, not 15 seconds
like we have now. In case of a server bug we delay exponentially
longer on each retry.
Current code totally craps out performance of very large files on
most pnfs-objects layouts, because of how the map changes when the
file has grown into the next raid group.
[Stable: This will patch back to 3.9. If there are earlier still
maintained trees, please tell me I'll send a patch]
CC: Stable Tree <stable@vger.kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
open/release operations require userspace transitions to keep track
of the open count and to perform any FS-specific setup. However,
for some purely read-only FSs which don't need to perform any setup
at open/release time, we can avoid the performance overhead of
calling into userspace for open/release calls.
This patch adds the necessary support to the fuse kernel modules to prevent
open/release operations from hitting in userspace. When the client returns
ENOSYS, we avoid sending the subsequent release to userspace, and also
remember this so that future opens also don't trigger a userspace
operation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Various read operations (e.g. readlink, readdir) invalidate the cached
attrs for atime changes. This patch adds a new function
'fuse_invalidate_atime', which checks for a read-only super block and
avoids the attr invalidation in that case.
Signed-off-by: Andrew Gallagher <andrewjcg@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
As noticed by Coverity the "num != 0" condition never triggers. Instead it
should check for a complete page.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Having this struct in module memory could Oops when if the module is
unloaded while the buffer still persists in a pipe.
Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
merge them into nosteal_pipe_buf_ops (this is the same as
default_pipe_buf_ops except stealing the page from the buffer is not
allowed).
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by
Gu Zheng.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As the orphan_blocks may be max to 504, so it is not security
and rigorous to store such a large array in the kernel stack
as Dan Carpenter said.
In fact, grab_meta_page has locked the page in the page cache,
and we can use find_get_page() to fetch the page safely in the
downstream, so we can remove the page array directly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function META_MAPPING() to get the cache meta blocks'
address space.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a dentry page is updated, we should call mark_inode_dirty to add the inode
into the dirty list, so that its dentry pages are flushed to the disk.
Otherwise, the inode can be evicted without flush.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch addresses a bug in bio_integrity_verify() code that has
been causing DIF READ verify operations to be silently skipped.
The issue is that bio->bi_idx will have been incremented within
bio_advance() code in the normal blk_update_request() ->
req_bio_endio() completion path, and bio_integrity_verify() is
using bio_for_each_segment() which starts the bio segment walk
at the current bio->bi_idx.
So instead use bio_for_each_segment_all() to always start the bio
segment walk from zero, regardless of the current bio->bi_idx
value after bio_advance() has been called.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: stable@kernel.dk # >= v3.10
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge first patch-bomb from Andrew Morton:
- a couple of misc things
- inotify/fsnotify work from Jan
- ocfs2 updates (partial)
- about half of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm/migrate: remove unused function, fail_migrate_page()
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
mm/migrate: correct failure handling if !hugepage_migration_support()
mm/migrate: add comment about permanent failure path
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
mm: compaction: reset scanner positions immediately when they meet
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
mm: compaction: detect when scanners meet in isolate_freepages
mm: compaction: reset cached scanner pfn's before reading them
mm: compaction: encapsulate defer reset logic
mm: compaction: trace compaction begin and end
memcg, oom: lock mem_cgroup_print_oom_info
sched: add tracepoints related to NUMA task migration
mm: numa: do not automatically migrate KSM pages
mm: numa: trace tasks that fail migration due to rate limiting
mm: numa: limit scope of lock for NUMA migrate rate limiting
mm: numa: make NUMA-migrate related functions static
lib/show_mem.c: show num_poisoned_pages when oom
mm/hwpoison: add '#' to hwpoison_inject
mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
...
Redefined {lock|release}_sock to sctp_{lock|release}_sock for user space friendly
code which we haven't use in years, so removing them.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This set includes a single change to speed up
recovery times when using SCTP connections.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJS3rFtAAoJEDgbc8f8gGmqLDcP/RHy6yJD25Tqcww8CrE+wZj5
b8sf9zsZR0tbT/L9LiB7F3gwoHoS5BUBs7vq8PUX/qAWLgw5dIL9ZqWBujN0M9pr
AOIyAwQtY1U9jINunfYXc4EXfLHYO9pbW7ZJuieg+Tls5meb+werBEbnmNsETht8
NYLcVYQ1VnxR6ohnTzUsheQ5lV8J5AKBYC53ZH32gU1Ua1d6xnxl0Fvq2gwM5uOg
Xxo5ugpes5NZLwcUB+Smsf6Fqq3z3oyJkWWh3Vq8BJSduOcNDVd45HXiDM7D/LT9
S6TJAbUH+nTesb4FOP7ew8OewxNJN0TbygJ4CVHvbiM3+7jGJi6esfurNyuATGQ5
VJ18el9T46k5D9s309aa3oNdd1CUbAUJ4eEJU+5/nssuVaAt8H0ETfvme7uIXlCp
Pyws08NDT3U91q6Xe1R1FTPFdu0DmqyM9uvMFrtNGMXQxqxXbwco70Y/3HJGhHxb
Uk8mOtzjCvVN0gAvB5fB71qNBWBmVsO8M10TsCErbjXb7HJkBKQIq/O5eVRDIVGZ
jBenSj6nK3ouQUTNFRas+6pQRR9Pa3m0WH4o9Gj2aCMp48XAc1TKd1ySxRuIplWQ
JnJg3BOFpILOqwCNVPXX6TRiWJKdtFrCAm/Q484AChh5GQjhyPWeQg/hiNS8SxG7
UXmHPpAUuoG7LcPfuabr
=/Wic
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm update from David Teigland:
"A single change to speed up recovery times when using SCTP
connections"
* tag 'dlm-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: set zero linger time on sctp socket
improvements when searching resource groups and several updates
to quotas which should increase scalability. The quota changes
follow on from those in the last merge window, and there will
likely be further work to come in this area in due course.
There are also a few patches which help to improve efficiency
of adding entries into directories, and clean up some of that
code.
One on-disk change is included this time, which is to write some
additional information which should be useful to fsck and
also potentially for debugging.
Other than that, its just a few small random bug fixes and
clean ups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJS3RZUAAoJEMrg3m4a/8jS/C4QAKAoGbGH8UxG2fNRUWK94UBL
lxxn0nUaQ7bwKpIbysEO6HHg2rqcfjRSdO/H6kfD51Fgg39RfIfvXF9woJw/5Is6
1xcn3faBl3dQhIq81+lSKnEJnJbQ49dbIcgxt1FGqPI8ZY6st/CBvnvqh0TRAcvN
7t0TYSPxmA/7BnwynstTbpK2l9uUisau8z12QzihJ6lRiR7QlfFSUq6Xs/ICSmkE
LE2dC6gXMIIDyMpPX5Fg9NQNEAPKOzMwduUaHnv2vpuh1jkk3ObHg4mUkYtdoUZx
Hbs1Ey5oqRiMs03I5wm1ayZg+6i/pT7HTM5vbrunTa21AMmJ0NTFdZ4ihVZqZwu4
ynpXY0OT/U7CpFyx6FSKtlq0kzwx5A51mL9bpSlLU9pggJY6NHQdmSJ9fdLW9cdS
Wwi2Br/0rBvIhm0ejy8kxK904DhqCGsH+RXbfOc8J4VBE4R/z5eMmBG9UqMm7c+7
hO6FhpJo9QsNMe19A8IANzG2gT+PcpgrXs2GTS1QFGhFgn08+k3RLDT/SrGRxU6r
MaCM9rms0SgMr2LX6MEPZdo3LJgBTNr1rKDEIzQ7BUXs3FHJ7U0swuHOTDXFVm51
gQanPYmVvFV7aiom3ExUhegDefBu1jMnY26SVxxLAbWN1Ek2Ikg6rO6wx3ckXvYo
SKY3qAewE5pNpHe7u6SS
=/tv0
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
"The main topics this time are allocation, in the form of Bob's
improvements when searching resource groups and several updates to
quotas which should increase scalability. The quota changes follow on
from those in the last merge window, and there will likely be further
work to come in this area in due course.
There are also a few patches which help to improve efficiency of
adding entries into directories, and clean up some of that code.
One on-disk change is included this time, which is to write some
additional information which should be useful to fsck and also
potentially for debugging.
Other than that, its just a few small random bug fixes and clean ups"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (24 commits)
GFS2: revert "GFS2: d_splice_alias() can't return error"
GFS2: Small cleanup
GFS2: Don't use ENOBUFS when ENOMEM is the correct error code
GFS2: Fix kbuild test robot reported warning
GFS2: Move quota bitmap operations under their own lock
GFS2: Clean up quota slot allocation
GFS2: Only run logd and quota when mounted read/write
GFS2: Use RCU/hlist_bl based hash for quotas
GFS2: No need to invalidate pages for a dio read
GFS2: Add initialization for address space in super block
GFS2: Add hints to directory leaf blocks
GFS2: For exhash conversion, only one block is needed
GFS2: Increase i_writecount during gfs2_setattr_chown
GFS2: Remember directory insert point
GFS2: Consolidate transaction blocks calculation for dir add
GFS2: Add directory addition info structure
GFS2: Use only a single address space for rgrps
GFS2: Use range based functions for rgrp sync/invalidation
GFS2: Remove test which is always true
GFS2: Remove gfs2_quota_change_host structure
...
Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available. They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is
pretty much guaranteed to be wrong today.
It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction
of system memory on mostly idle systems with lots of files.
Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.
However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.
It is more convenient to provide such an estimate in /proc/meminfo. If
things change in the future, we only have to change it in one place.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Erik Mouw <erik.mouw_2@nxp.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ramfs is always built in. It will never be modular, so using
module_init as an alias for __initcall is rather misleading.
Fix this up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.
Note that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of fs_initcall (which makes sense for fs code)
will thus change this registration from level 6-device to level 5-fs
(i.e. slightly earlier). However no observable impact of that small
difference has been observed during testing, or is expected.
Also note that this change uncovers a missing semicolon bug in the
registration of the initcall.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On fail path alloc_super() calls destroy_super(), which issues a warning
if the sb's s_mounts list is not empty, in particular if it has not been
initialized. That said s_mounts must be initialized in alloc_super()
before any possible failure, but currently it is initialized close to
the end of the function leading to a useless warning dumped to log if
either percpu_counter_init() or list_lru_init() fails. Let's fix this.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compat_do_readv_writev() function was doing a verify_area on the
incoming iov, but the nr_segs value is not checked. If someone passes
in a -1 for nr_segs, for instance, the function should return an EINVAL.
However, it returns a EFAULT because the verify_area fails because it is
checking an array of size MAX_UINT. The check is bogus, anyway, because
the next check, compat_rw_copy_check_uvector(), will do all the
necessary checking, anyway. The non-compat do_readv_writev() function
doesn't do this check, so I think it's safe to just remove the code.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code
allows negative values. It's harmless but it makes my static checker
upset so I've made nsmgs unsigned.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uninline vast tracts of nested inline functions in
include/linux/posix_acl.h.
This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by
8026 bytes.
The patch also regularises the positioning of the EXPORT_SYMBOLs in
posix_acl.c.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Benny Halevy <bhalevy@primarydata.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
create a file 1.
Node A Node B
open 1, get open lock
rm 1, and then add 1 to orphan_dir
storage link down,
o2hb_write_timeout
->o2quo_disk_timeout
->emergency_restart
at the moment, Node B dismount and do
ocfs2rec simultaneously
1) ocfs2_dismount_volume
->ocfs2_recovery_exit
->wait_event(osb->recovery_event)
->flush_workqueue(ocfs2_wq)
2) ocfs2rec
->queue_work(&journal->j_recovery_work)
->ocfs2_recover_orphans
->ocfs2_commit_truncate
->queue_delayed_work(&osb->osb_truncate_log_wq)
In ocfs2_recovery_exit, it flushes workqueue and then releases system
inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
which will try to get sys_root_inode, and NULL pointer dereference
occurs.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative
length.
Orabug:14789508
Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes the following sparse warning:
fs/ocfs2/stack_user.c:930:32: warning:
symbol 'ocfs2_ls_ops' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given
minimum size in bytes is less than it because, discard granularity is
used to tell us that the minimum size of extent that can be discarded by
the storage device.
This is inspired by ext4 commit 5c2ed62fd4 ("ext4: Adjust minlen with
discard_granularity in the FITRIM ioctl") from Lukas Czerner.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For FITRIM ioctl(2), we should not keep silence if the given range
length ls less than a block size as there is no data blocks would be
discareded. Hence it should return EINVAL instead. This issue can be
verified via xfstests/generic/288 which is used for FITRIM argument
handling tests.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that
the storage device does not support discard if it is, otherwise return
success would confuse the user even though there is no free blocks were
trimmed at all.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are
already provided in suballoc.c. So, the same functions in
move_extents.c are not needed any more.
Declare the functions in suballoc.h and remove redundant functions in
move_extents.c.
Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Attempt to use the new DLM operations. If it is not supported, use the
traditional ocfs2_controld.
To exchange ocfs2 versioning, we use the LVB of the version dlm lock.
It first attempts to take the lock in EX mode (non-blocking). If
successful (which means it is the first mount), it writes the version
number and downconverts to PR lock. If it is unsuccessful, it reads the
version from the lock.
If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.
Dan: Since ocfs2_protocol_version are two u8 values, the additional
checks with LONG* don't make sense.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the native DLM locks for version control negotiation. Most of the
framework is taken from gfs2/lock_dlm.c
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is done to differentiate between using and not using controld and
use the connection information accordingly.
We need to be backward compatible. So, we use a new enum
ocfs2_connection_type to identify when controld is used and when it is
not.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated (in the last patch of the series).
Before calling dlm_new_lockspace(), we need the structure ready for the
.recover_done() callback, which would set oc_this_node. This is the
reason we allocate ocfs2_live_connection beforehand in user_connect().
[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reveiwed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are the callbacks called by the fs/dlm code in case the membership
changes. If there is a failure while/during calling any of these, the
DLM creates a new membership and relays to the rest of the nodes.
- recover_prep() is called when DLM understands a node is down.
- recover_slot() is called once all nodes have acknowledged
recover_prep and recovery can begin.
- recover_done() is called once the recovery is complete. It returns
the new membership.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
This patch (of 6):
Add clustername to cluster connection.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We usually rely on the fact that struct members not specified in the
initializer are set to NULL. So do that with fsnotify function pointers
as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removing event structure creation from the generic layer there is
no reason for separate .should_send_event and .handle_event callbacks.
So just remove the first one.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently fsnotify framework creates one event structure for each
notification event and links this event into all interested notification
groups. This is done so that we save memory when several notification
groups are interested in the event. However the need for event
structure shared between inotify & fanotify bloats the event structure
so the result is often higher memory consumption.
Another problem is that fsnotify framework keeps path references with
outstanding events so that fanotify can return open file descriptors
with its events. This has the undesirable effect that filesystem cannot
be unmounted while there are outstanding events - a regression for
inotify compared to a situation before it was converted to fsnotify
framework. For fanotify this problem is hard to avoid and users of
fanotify should kind of expect this behavior when they ask for file
descriptors from notified files.
This patch changes fsnotify and its users to create separate event
structure for each group. This allows for much simpler code (~400 lines
removed by this patch) and also smaller event structures. For example
on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
additional space for file name, additional 24 bytes for second and each
subsequent group linking the event, and additional 32 bytes for each
inotify group for private data. After the conversion inotify event
consumes 48 bytes plus space for file name which is considerably less
memory unless file names are long and there are several groups
interested in the events (both of which are uncommon). Fanotify event
fits in 56 bytes after the conversion (fanotify doesn't care about file
names so its events don't have to have it allocated). A win unless
there are four or more fanotify groups interested in the event.
The conversion also solves the problem with unmount when only inotify is
used as we don't have to grab path references for inotify events.
[hughd@google.com: fanotify: fix corruption preventing startup]
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rounding of name length when passing it to userspace was done in several
places. Provide a function to do it and use it in all places.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cond_resched_lock(cinfo->lock) is called everywhere else while holding
the cinfo->lock spinlock. Not holding this lock while calling
transfer_commit_list in filelayout_recover_commit_reqs causes the BUG
below.
It's true that we can't hold this lock while calling pnfs_put_lseg,
because that might try to lock the inode lock - which might be the
same lock as cinfo->lock.
To reproduce, mount a 2 DS pynfs server and run an O_DIRECT command
that crosses a stripe boundary and is not page aligned, such as:
dd if=/dev/zero of=/mnt/f bs=17000 count=1 oflag=direct
BUG: sleeping function called from invalid context at linux/fs/nfs/nfs4filelayout.c:1161
in_atomic(): 0, irqs_disabled(): 0, pid: 27, name: kworker/0:1
2 locks held by kworker/0:1/27:
#0: (events){.+.+.+}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5
#1: ((&dreq->work)){+.+...}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5
CPU: 0 PID: 27 Comm: kworker/0:1 Not tainted 3.13.0-rc3-branch-dros_testing+ #21
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
Workqueue: events nfs_direct_write_schedule_work [nfs]
0000000000000000 ffff88007a39bbb8 ffffffff81491256 ffff88007b87a130 ffff88007a39bbd8 ffffffff8105f103 ffff880079614000 ffff880079617d40 ffff88007a39bc20 ffffffffa011603e ffff880078988b98 0000000000000000
Call Trace:
[<ffffffff81491256>] dump_stack+0x4d/0x66
[<ffffffff8105f103>] __might_sleep+0x100/0x105
[<ffffffffa011603e>] transfer_commit_list+0x94/0xf1 [nfs_layout_nfsv41_files]
[<ffffffffa01160d6>] filelayout_recover_commit_reqs+0x3b/0x68 [nfs_layout_nfsv41_files]
[<ffffffffa00ba53a>] nfs_direct_write_reschedule+0x9f/0x1d6 [nfs]
[<ffffffff810705df>] ? mark_lock+0x1df/0x224
[<ffffffff8106e617>] ? trace_hardirqs_off_caller+0x37/0xa4
[<ffffffff8106e691>] ? trace_hardirqs_off+0xd/0xf
[<ffffffffa00ba8f8>] nfs_direct_write_schedule_work+0x9d/0xb7 [nfs]
[<ffffffff810501d7>] ? process_one_work+0x175/0x3a5
[<ffffffff81050258>] process_one_work+0x1f6/0x3a5
[<ffffffff810501d7>] ? process_one_work+0x175/0x3a5
[<ffffffff8105187e>] worker_thread+0x149/0x1f5
[<ffffffff81051735>] ? rescuer_thread+0x28d/0x28d
[<ffffffff81056d74>] kthread+0xd2/0xda
[<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61
[<ffffffff8149e66c>] ret_from_fork+0x7c/0xb0
[<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.
This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.
We remove the exported caps because they are stale, keeping them
can compromise consistence.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
handle following sequence of events:
- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc. This is primarily being done for
the cgroups filesystem, but the goal is to also move debugfs to it when
it is ready, solving all of the known issues in that filesystem as well.
The code isn't completed yet, but all should be stable now (there is a
big section that was reverted due to problems found when testing.)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be using
soon (it's in this tree to make merges after -rc1 easier.)
All of this has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
=0pc0
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)
This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)
All of this has been in linux-next with no reported issues"
* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...
If clp is new (cl_count = 1) and it matches another client in
nfs4_discover_server_trunking, the nfs_put_client will free clp before
->cl_preserve_clid is set.
Cc: stable@vger.kernel.org # 3.7+
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Rename CIFSSMBOpen to CIFS_open and make it take
cifs_open_parms structure as a parm.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Rename camel case variable and fix comment style.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
Remove indentation, fix comment style, rename camel case
variables in preparation to make it work with cifs_open_parms
structure as a parm.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
When using posix extensions, dfs shares in the dfs root show up as
symlinks resulting in userland tools such as 'ls' calling readlink() on
these shares. Since these are dfs shares, we end up returning -EREMOTE.
$ ls -l /mnt
ls: cannot read symbolic link /mnt/test: Object is remote
total 0
lrwxrwxrwx. 1 root root 19 Nov 6 09:47 test
With added follow_link() support for dfs shares, when using unix
extensions, we call GET_DFS_REFERRAL to obtain the DFS referral and
return the first node returned.
The dfs share in the dfs root is now displayed in the following manner.
$ ls -l /mnt
total 0
lrwxrwxrwx. 1 root root 19 Nov 6 09:47 test -> \vm140-31\test
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Unix extensions rigth now are only applicable to smb1 operations.
Move the check and subsequent unix extension call to the smb1
specific call to query_symlink() ie. cifs_query_symlink().
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This patch makes cosmetic changes. We group similar functions together
and separate out the protocol specific functions.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Add a new protocol ops function create_mf_symlink and have
create_mf_symlink() use it.
This patchset moves the MFSymlink operations completely to the
ops structure so that we only use the right protocol versions when
querying or creating MFSymlinks.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
We have an existing protocol specific call query_mf_symlink() created
for check_mf_symlink which can also be used for query_mf_symlink().
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Clean up camel case in functionnames.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Rename open_query_close_cifs_symlink to cifs_query_mf_symlink() to make
the name more consistent with other protocol version specific functions.
We also pass tcon as an argument to the function. This is already
available in the calling functions and we can avoid having to make an
unnecessary lookup.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Fix a potential memory leak in the cifs_hardlink() error handling path.
Detected by Coverity: CID 728510, CID 728511.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Steve French <smfrench@gmail.com>
Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.
Signed-off-by: Chris Fries <cfries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Both nfs41_walk_client_list and nfs40_walk_client_list expect the
'status' variable to be set to the value -NFS4ERR_STALE_CLIENTID
if the loop fails to find a match.
The problem is that the 'pos->cl_cons_state > NFS_CS_READY' changes
the value of 'status', and sets it either to the value '0' (which
indicates success), or to the value EINTR.
Cc: stable@vger.kernel.org # 3.7.x: 7b1f1fd1842e6: NFSv4/4.1: Fix bugs in
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
0d0d110720 asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".
That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull namespace fixes from Eric Biederman:
"This is a set of 3 regression fixes.
This fixes /proc/mounts when using "ip netns add <netns>" to display
the actual mount point.
This fixes a regression in clone that broke lxc-attach.
This fixes a regression in the permission checks for mounting /proc
that made proc unmountable if binfmt_misc was in use. Oops.
My apologies for sending this pull request so late. Al Viro gave
interesting review comments about the d_path fix that I wanted to
address in detail before I sent this pull request. Unfortunately a
bad round of colds kept from addressing that in detail until today.
The executive summary of the review was:
Al: Is patching d_path really sufficient?
The prepend_path, d_path, d_absolute_path, and __d_path family of
functions is a really mess.
Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
No it is not appropriate to rewrite all of d_path for a regression
that has existed for entirely too long already, when a two line
change will do"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Fix a regression in mounting proc
fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
vfs: In d_path don't call d_dname on a mount point
We should always make sure the cached page is up-to-date when we're
determining whether we can extend a write to cover the full page -- even
if we've received a write delegation from the server.
Commit c7559663 added logic to skip this check if we have a write
delegation, which can lead to data corruption such as the following
scenario if client B receives a write delegation from the NFS server:
Client A:
# echo 123456789 > /mnt/file
Client B:
# echo abcdefghi >> /mnt/file
# cat /mnt/file
0�D0�abcdefghi
Just because we hold a write delegation doesn't mean that we've read in
the entire page contents.
Cc: <stable@vger.kernel.org> # v3.11+
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once created, a kernfs_node is always destroyed by kernfs_put().
Since ba7443bc65 ("sysfs, kernfs: implement
kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
to locate the ino_ida. kernfs_root() in turn depends on
kernfs_node->parent being set for !dir nodes. This means that
kernfs_put() of a !dir node requires its ->parent to be initialized.
This leads to oops when a newly created !dir node is destroyed without
going through kernfs_add_one() or after failing kernfs_add_one()
before ->parent is set. kernfs_root() invoked from kernfs_put() will
try to dereference NULL parent.
Fix it by moving parent association to kernfs_new_node() from
kernfs_add_one(). kernfs_new_node() now takes @parent instead of
@root and determines the root from the parent and also sets the new
node's parent properly. @parent parameter is removed from
kernfs_add_one(). As there's no parent when creating the root node,
__kernfs_new_node() which takes @root as before and doesn't set the
parent is used in that case.
This ensures that a kernfs_node in any stage in its life has its
parent associated and thus can be put.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
"disconnected" is too easily confused with "DCACHE_DISCONNECTED". I
think "unhashed" is the more precise term here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This is a small cleanup to function gfs2_rgrp_go_lock so that it
uses rgd instead of its more complicated twin.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.
> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following bug.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
(&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
(&sbi->cp_rwsem){++++.?}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->cp_rwsem);
local_irq_disable();
lock(&sbi->gc_mutex);
lock(&sbi->cp_mutex);
<Interrupt>
lock(&sbi->gc_mutex);
*** DEADLOCK ***
This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJS080OAAoJECvKgwp+S8JaIdUQAJKNZTzXKylUjUZty42t57Jh
1qRrQeJ6ha+JVSpYX4jJz/mSzUdJdjoFg7J3O54OnVFj/CnlcY7GRZj3VMel9ijf
uhlf8DcU6JsThcFK4Q6mqXtdAHDPkQ1jkQHLNCe7bow9AjCzHymAZWJix4YvEsXF
zeJJURMqSaJeo/44MynnXyn/h5RRhg+5HWErhoFiVUzDzHR3RoQqmt3lPVVJkdj1
iokHLMzGui2vs52vUJj2yx7m9kaoDx/6bJpqR61qHfk5S4wjLkUI+1ID8dsTNVF2
4O3THb0nUDWx4wuJIxrAKoPiYjiemX1KmQXlUVr3IsfhDiiBbLyviVyn4aRaFIxV
IRCVXCj1CWw+cFLeCA5E+/WvpxjLfKs4WNBxIqjes5YRPM4PLpU3MDiabssaUzHI
0VPbU8TQ05hqH0wbs0hIgXyvED6yNn9d3sPHS2Lb5i2tp3E0FzVEoh2EH2jn8lmQ
1DAdi+ezk9EiJs8AFiN6MSIBpAZosX3Nq+RTmYGKqLZMGnxlJ30YspNlipiBPFpC
4xokkMZAZ0+wzpVabOMie36Rc/AaOAqiOjS1C6UIoOSrBTgtwWL7Ft2Da3SKb0KX
XQhNWCHNYcgOn9/DDDmGxzwt6HsEzOIYinMwrG37LSass5KEvopssmiLCXn8wry+
QXUoiFFFAPpg8iXaqj4X
=AHdo
-----END PGP SIGNATURE-----
Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback fix from Wu Fengguang:
"Fix data corruption on NFS writeback.
It has been in linux-next for one month"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix data corruption on NFS
Well I don't get the same warning locally as the kbuild
robot, but I guess this should fix the problem, anyway.
Here is the warning:
head: 2d9e72303d
commit: ee2411a8db [19/20] GFS2: Clean up quota slot allocation
config: make ARCH=powerpc allmodconfig
All error/warnings:
fs/gfs2/quota.c: In function 'gfs2_quota_init':
>> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
>> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
>> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(sdp->sd_quota_bitmap);
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is a bug in the function nilfs_segctor_collect, which results in
active data being written to a segment, that is marked as clean. It is
possible, that this segment is selected for a later segment
construction, whereby the old data is overwritten.
The problem shows itself with the following kernel log message:
nilfs_sufile_do_cancel_free: segment 6533 must be clean
Usually a few hours later the file system gets corrupted:
NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
The issue can be reproduced with a file system that is nearly full and
with the cleaner running, while some IO intensive task is running.
Although it is quite hard to reproduce.
This is what happens:
1. The cleaner starts the segment construction
2. nilfs_segctor_collect is called
3. sc_stage is on NILFS_ST_SUFILE and segments are freed
4. sc_stage is on NILFS_ST_DAT current segment is full
5. nilfs_segctor_extend_segments is called, which
allocates a new segment
6. The new segment is one of the segments freed in step 3
7. nilfs_sufile_cancel_freev is called and produces an error message
8. Loop around and the collection starts again
9. sc_stage is on NILFS_ST_SUFILE and segments are freed
including the newly allocated segment, which will contain active
data and can be allocated at a later time
10. A few hours later another segment construction allocates the
segment and causes file system corruption
This can be prevented by simply reordering the statements. If
nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
the freed segments are marked as dirty and cannot be allocated any more.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gradually, the global qd_lock is being used for less and less.
After this patch it will only be used for the per super block
list whose purpose is to allow syncing of changes back to the
master quota file from the local quota changes file. Fixing
up that process to make it more efficient will be the subject
of a later patch, however this patch removes another barrier
to doing that.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Quota slot allocation has historically used a vector of pages
and a set of homegrown find/test/set/clear bit functions. Since
the size of the bitmap is likely to be based on the default
qc file size, thats a couple of pages at most. So we ought
to be able to allocate that as a single chunk, with a vmalloc
fallback, just in case of memory fragmentation.
We are then able to use the kernel's own find/test/set/clear
bit functions, rather than rolling our own.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
While investigating a rather strange bit of code in the quota
clean up function, I spotted that the reason for its existence
was that when remounting read only, we were not stopping the
quotad thread, and thus it was possible for it to still have
a reference to some of the quotas in that case.
This patch moves the logd and quota thread start and stop into
the make_fs_rw/ro functions, so that we now stop those threads
when mounted read only.
This means that quotad will always be stopped before we call
the quota clean up function, and we can thus dispose of the
(rather hackish) code that waits for it to give up its
reference on the quotas.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Prior to this patch, GFS2 kept all the quotas for each
super block in a single linked list. This is rather slow
when there are large numbers of quotas.
This patch introduces a hlist_bl based hash table, similar
to the one used for glocks. The initial look up of the quota
is now lockless in the case where it is already cached,
although we still have to take the per quota spinlock in
order to bump the ref count. Either way though, this is a
big improvement on what was there before.
The qd_lock and the per super block list is preserved, for
the time being. However it is intended that since this is no
longer used for its original role, it should be possible to
shrink the number of items on that list in due course and
remove the requirement to take qd_lock in qd_get.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We recently fixed the writeback of pages prior to performing
direct i/o, however the initial fix was perhaps a bit heavy
handed. There is no need to invalidate pages if the direct i/o
is only a read, since they will be identical to what has been
flushed to disk anyway.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When kernfs_seq_start() fails to obtain an active reference, it
returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the
error pointer value; however, it still proceeds to invoke
kernfs_put_active() on the node leading to unbalanced put.
If kernfs_seq_stop() is called even after active ref failure, it
should skip invocation of @ops->seq_stop() and put_active.
Unfortunately, this is a bit complicated because active ref failure
isn't the only thing which may fail with ERR_PTR(-ENODEV).
@ops->seq_start/next() may also fail with the error value and
kernfs_seq_stop() doesn't have a way to tell apart those failures.
Work it around by factoring out the active part of kernfs_seq_stop()
into kernfs_seq_stop_active() and invoking it directly if
@ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
kernfs_seq_stop() to skip kernfs_seq_stop_active() on
ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put
is skipped iff get_active failed in kernfs_seq_start().
tj: This was originally committed as d92d2e6bd7 but got reverted by
683bb2761f along with other kernfs self removal patches.
However, this one is an independent fix and shouldn't have been
reverted together. Reinstate the change. Sorry about the mess.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/status
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move alloc new orphan node out of lock protection region.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This reverts commit d92d2e6bd7.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit ea1c472dfe.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit a69d001cfc.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit ae34372eb8.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 45a140e587.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make sure to properly invalidate the pagecache before performing direct I/O,
so that no stale pages are left around. This matches what the generic
direct I/O code does. Also take the i_mutex over the direct write submission
to avoid the lifelock vs truncate waiting for i_dio_count to decrease, and
to avoid having the pagecache easily repopulated while direct I/O is in
progrss. Again matching the generic direct I/O code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We'll need the i_mutex to prevent i_dio_count from incrementing while
truncate is waiting for it to reach zero, and protects against having
the pagecache repopulated after we flushed it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Simple code cleanup to prepare for later fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Simple code cleanup to prepare for later fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
i_dio_count is used to protect dio access against truncate. We want
to make sure there are no dio reads pending either when doing a
truncate. I suspect on plain NFS things might work even without
this, but once we use a pnfs layout driver that access backing devices
directly things will go bad without the proper synchronization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to have the I/O fully finished before telling the truncate code
that we are done.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_file_direct_write only updates the inode size if it succeeded and
returned the number of bytes written. But in the AIO case nfs_direct_wait
turns the return value into -EIOCBQUEUED and we skip the size update.
Instead the aio completion path should updated it, which this patch
does. The implementation is a little hacky because there is no obvious
way to find out we are called for a write in nfs_direct_complete.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Don't check for -NFS4ERR_NOTSUPP, it's already been mapped to -ENOTSUPP
by nfs4_stat_to_errno.
This allows the client to mount v4.1 servers that don't support
SECINFO_NO_NAME by falling back to the "guess and check" method of
nfs4_find_root_sec.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This reverts commit f601f9a2bf.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 99177a3411.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 895a068a52.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 9f010c2ad5.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 1ae06819c7.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit d1ba277e79.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 88533f990c.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nfs4_write_inode() must not be allowed to exit until the layoutcommit
is done. That means that both NFS_INO_LAYOUTCOMMIT and
NFS_INO_LAYOUTCOMMITTING have to be cleared.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a LAYOUTCOMMIT is outstanding, then chances are that the metadata
server may still be returning incorrect values for the change attribute,
ctime, mtime and/or size.
Just ignore those attributes for now, and wait for the LAYOUTCOMMIT
rpc call to finish.
Reported-by: shaobingqing <shaobingqing@bwstor.com.cn>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
895a068a52 ("kernfs: make kernfs_get_active() block if the node is
deactivated but not removed") added "struct kernfs_root *root =
kernfs_root(kn);" at the head of the function; however, the parameter
@kn is checked for later implying that the function may be called with
NULL. This means that we may end up invoking kernfs_root() with NULL
which will oops. None of the existing users invokes removal with NULL
@kn, so this bug doesn't actually trigger.
We can relocate kernfs_root() invocation after NULL check; however,
allowing NULL param tends to cause more confusion than actually
helping anything. As there's no existing user, let's remove the
spurious NULL check.
This bug was detected by smatch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All device_schedule_callback_owner() users are converted to use
device_remove_file_self(). Remove now unused
{sysfs|device}_schedule_callback_owner().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- fix off-by-one in xfs_attr3_rmt_verify
- fix missing destroy_work_on_stack() in xfs_bmapi_allocate
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJS0ECWAAoJENaLyazVq6ZOgn0QAKSC/pkP4km+QbmL0R7SqSJH
ZSSj16gIjR5lHlwI3PQzv5BgyEC9BcRDKWXN6dy+GHHuMtP4qYK8cLWFcyl7EysH
HAyDBnaJVphXt23C5iIzk+iseNfRYXA2LOpYSH6qfhZ5bxEeYzQS42zL4YhxZrXq
kzLHojcTLUx0IzJ+4oHn5AXSgPt+PXxNz3s+TU9virFnfSMlw2qYukxQtG49nbQr
kQjNHgeTIBKzeHdlnxmv5Rd2bD//397w5aWXxmaUh8fk6Z7VJi40ALAG4Pks81HF
+TEgMtF9/xTXdlwrYJDoHp++vUs6HANCX+wSAb4MdrBQvjh/USytK2WFwOeMyyR6
L/iogfPXHHizTkoYSzPwPdEmCCFhzidvBEqNX68+ojlJnDtoart7IgkOcm9LvaQI
j//u76CPRcd8tFh+1fDNaXn1ykJ6/CepSY13/yOnbpc7JoDbtqK2R8HFxdSlkDDg
UooLF2AfQ6lX280cUWwV0flqGO6iTIM3Fw1mIq3z8X4usNn+bMnlOu/DUnCbF5bB
YJCV4uT7f04w7oJqin9a7LHaHKRD56tWQun/OCEd7ZV/hJ1YRYlhhLfSdWdX7+SX
oIawXJy7NvCPQLaTwycD3h2gDlaxw17GAc9rA3AcCknxBsgNosv1ETQnEPC4iIAq
QsVal7p6oMLZ/qx6mvX7
=Xpq3
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
"Here we have a bugfix for an off-by-one in the remote attribute
verifier that results in a forced shutdown which you can hit with v5
superblock by creating a 64k xattr, and a fix for a missing
destroy_work_on_stack() in the allocation worker.
It's a bit late, but they are both fairly straightforward"
* tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs:
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
Sometimes it's necessary to implement a node which wants to delete
nodes including itself. This isn't straightforward because of kernfs
active reference. While a file operation is in progress, an active
reference is held and kernfs_remove() waits for all such references to
drain before completing. For a self-deleting node, this is a deadlock
as kernfs_remove() ends up waiting for an active reference that itself
is sitting on top of.
This currently is worked around in the sysfs layer using
sysfs_schedule_callback() which makes such removals asynchronous.
While it works, it's rather cumbersome and inherently breaks
synchronicity of the operation - the file operation which triggered
the operation may complete before the removal is finished (or even
started) and the removal may fail asynchronously. If a removal
operation is immmediately followed by another operation which expects
the specific name to be available (e.g. removal followed by rename
onto the same name), there's no way to make the latter operation
reliable.
The thing is there's no inherent reason for this to be asynchrnous.
All that's necessary to do this synchronous is a dedicated operation
which drops its own active ref and deactivates self. This patch
implements kernfs_remove_self() and its wrappers in sysfs and driver
core. kernfs_remove_self() is to be called from one of the file
operations, drops the active ref and deactivates using
__kernfs_deactivate_self(), removes the self node, and restores active
ref to the dead node using __kernfs_reactivate_self() so that the ref
is balanced afterwards. __kernfs_remove() is updated so that it takes
an early exit if the target node is already fully removed so that the
active ref restored by kernfs_remove_self() after removal doesn't
confuse the deactivation path.
This makes implementing self-deleting nodes very easy. The normal
removal path doesn't even need to be changed to use
kernfs_remove_self() for the self-deleting node. The method can
invoke kernfs_remove_self() on itself before proceeding the normal
removal path. kernfs_remove() invoked on the node by the normal
deletion path will simply be ignored.
This will replace sysfs_schedule_callback(). A subtle feature of
sysfs_schedule_callback() is that it collapses multiple invocations -
even if multiple removals are triggered, the removal callback is run
only once. An equivalent effect can be achieved by testing the return
value of kernfs_remove_self() - only the one which gets %true return
value should proceed with actual deletion. All other instances of
kernfs_remove_self() will wait till the enclosing kernfs operation
which invoked the winning instance of kernfs_remove_self() finishes
and then return %false. This trivially makes all users of
kernfs_remove_self() automatically show correct synchronous behavior
even when there are multiple concurrent operations - all "echo 1 >
delete" instances will finish only after the whole operation is
completed by one of the instances.
v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
and sysfs_remove_file_self() had incorrect return type. Fix it.
Reported by kbuild test bot.
v3: Updated to use __kernfs_{de|re}activate_self().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch implements four functions to manipulate deactivation state
- deactivate, reactivate and the _self suffixed pair. A new fields
kernfs_node->deact_depth is added so that concurrent and nested
deactivations are handled properly. kernfs_node->hash is moved so
that it's paired with the new field so that it doesn't increase the
size of kernfs_node.
A kernfs user's lock would normally nest inside active ref but during
removal the user may want to perform kernfs_remove() while holding the
said lock, which would introduce a reverse locking dependency. This
function can be used to break such reverse dependency by allowing
deactivation step to performed separately outside user's critical
section.
This will also be used implement kernfs_remove_self().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, kernfs_get_active() fails if the target node is
deactivated. This is fine as a node always gets removed after
deactivation; however, we're gonna add reactivation so the assumption
won't hold. It'd be incorrect for kernfs_get_active() to fail for a
node which was deactivated only temporarily.
This patch makes kernfs_get_active() block if the node is deactivated
but not removed. If the node gets reactivated (not yet implemented),
it will be retried and succeed. If the node gets removed, it will be
woken up and fail.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
added because there were operations which should be performed outside
kernfs_mutex after adding and removing kernfs_nodes. The necessary
operations were recorded in kernfs_addrm_cxt and performed by
kernfs_addrm_finish(); however, after the recent changes which
relocated deactivation and unmapping so that they're performed
directly during removal, the only operation kernfs_addrm_finish()
performs is kernfs_put(), which can be moved inside the removal path
too.
This patch moves the kernfs_put() of the base ref to __kernfs_remove()
and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
* kernfs_add_one() is updated to grab and release the parent's active
ref and kernfs_mutex itself. kernfs_get/put_active() and
kernfs_addrm_start/finish() invocations around it are removed from
all users.
* __kernfs_remove() puts an unlinked node directly instead of chaining
it to kernfs_addrm_cxt. Its callers are updated to grab and release
kernfs_mutex instead of calling kernfs_addrm_start/finish() around
it.
v2: Updated to fit the v2 restructuring of removal path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
the target file before kernfs_remove() finishes; however, it currently
is being called from kernfs_addrm_finish() and has the same race
problem as the original implementation of deactivation when there are
multiple removers - only the remover which snatches the node to its
addrm_cxt->removed list is guaranteed to wait for its completion
before returning.
It can be fixed by moving kernfs_unmap_bin_file() invocation from
kernfs_addrm_finish() to __kernfs_remove(). The function may be
called multiple times but that shouldn't do any harm.
We end up dropping kernfs_mutex in the removal loop and the node may
be removed inbetween by someone else. kernfs_unlink_sibling() is
updated to test whether the node has already been removed and return
accordingly. __kernfs_remove() in turn performs post-unlinking
cleanup only if it actually unlinked the node.
KERNFS_HAS_MMAP test is moved out of the unmap function into
__kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily.
While at it, drop the now meaningless "bin" qualifier from the
function name.
v2: Rewritten to fit the v2 restructuring of removal path. HAS_MMAP
test relocated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The recursive nature of kernfs_remove() means that, even if
kernfs_remove() is not allowed to be called multiple times on the same
node, there may be race conditions between removal of parent and its
descendants. While we can claim that kernfs_remove() shouldn't be
called on one of the descendants while the removal of an ancestor is
in progress, such rule is unnecessarily restrictive and very difficult
to enforce. It's better to simply allow invoking kernfs_remove() as
the caller sees fit as long as the caller ensures that the node is
accessible.
The current behavior in such situations is broken. Whoever enters
removal path first takes the node off the hierarchy and then
deactivates. Following removers either return as soon as it notices
that it's not the first one or can't even find the target node as it
has already been removed from the hierarchy. In both cases, the
following removers may finish prematurely while the nodes which should
be removed and drained are still being processed by the first one.
This patch restructures so that multiple removers, whether through
recursion or direction invocation, always follow the following rules.
* When there are multiple concurrent removers, only one puts the base
ref.
* Regardless of which one puts the base ref, all removers are blocked
until the target node is fully deactivated and removed.
To achieve the above, removal path now first deactivates the subtree,
drains it and then unlinks one-by-one. __kernfs_deactivate() is
called directly from __kernfs_removal() and drops and regrabs
kernfs_mutex for each descendant to drain active refs. As this means
that multiple removers can enter __kernfs_deactivate() for the same
node, the function is updated so that it can handle multiple
deactivators of the same node - only one actually deactivates but all
wait till drain completion.
The restructured removal path guarantees that a removed node gets
unlinked only after the node is deactivated and drained. Combined
with proper multiple deactivator handling, this guarantees that any
invocation of kernfs_remove() returns only after the node itself and
all its descendants are deactivated, drained and removed.
v2: Draining separated into a separate loop (used to be in the same
loop as unlink) and done from __kernfs_deactivate(). This is to
allow exposing deactivation as a separate interface later.
Root node removal was broken in v1 patch. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KERNFS_REMOVED is used to mark half-initialized and dying nodes so
that they don't show up in lookups and deny adding new nodes under or
renaming it; however, its role overlaps those of deactivation and
removal from rbtree.
It's necessary to deny addition of new children while removal is in
progress; however, this role considerably intersects with deactivation
- KERNFS_REMOVED prevents new children while deactivation prevents new
file operations. There's no reason to have them separate making
things more complex than necessary.
KERNFS_REMOVED is also used to decide whether a node is still visible
to vfs layer, which is rather redundant as equivalent determination
can be made by testing whether the node is on its parent's children
rbtree or not.
This patch removes KERNFS_REMOVED.
* Instead of KERNFS_REMOVED, each node now starts its life
deactivated. This means that we now use both atomic_add() and
atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler
generates an overflow warnings when negating INT_MIN as the negation
can't be represented as a positive number. Nothing is actually
broken but let's bump BIAS by one to avoid the warnings for archs
which negates the subtrahend..
* KERNFS_REMOVED tests in add and rename paths are replaced with
kernfs_get/put_active() of the target nodes. Due to the way the add
path is structured now, active ref handling is done in the callers
of kernfs_add_one(). This will be consolidated up later.
* kernfs_remove_one() is updated to deactivate instead of setting
KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(),
which is now renamed to kernfs_drain().
* kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
dropped. A node which is removed from the children rbtree is not
included in the iteration in the first place. This means that a
node may be visible through vfs a bit longer - it's now also visible
after deactivation until the actual removal. This slightly enlarged
window difference doesn't make any difference to the userland.
* Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
checks on the active ref.
* Some comment style updates in the affected area.
v2: Reordered before removal path restructuring. kernfs_active()
dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE()
used in the lookup paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There currently are two mechanisms gating active ref lockdep
annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
The former disables lockdep annotations in kernfs_get/put_active()
while the latter disables all of kernfs_deactivate().
While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
deactivation step for non-file nodes, the benefit is marginal and it
needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF and use
KERNFS_LOCKDEP in kernfs_deactivate() too.
While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
flag so that it's more convenient and the related code can be compiled
out when not enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernfs_node->u.completion is used to notify deactivation completion
from kernfs_put_active() to kernfs_deactivate(). We now allow
multiple racing removals of the same node and the current removal
scheme is no longer correct - kernfs_remove() invocation may return
before the node is properly deactivated if it races against another
removal. The removal path will be restructured to address the issue.
To help such restructure which requires supporting multiple waiters,
this patch replaces kernfs_node->u.completion with
kernfs_root->deactivate_waitq. This makes deactivation event
notifications share a per-root waitqueue_head; however, the wait path
is quite cold and this will also allow shaving one pointer off
kernfs_node.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When kernfs_seq_start() fails to obtain an active reference, it
returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the
error pointer value; however, it still proceeds to invoke
kernfs_put_active() on the node leading to unbalanced put.
If kernfs_seq_stop() is called even after active ref failure, it
should skip invocation of @ops->seq_stop() and put_active.
Unfortunately, this is a bit complicated because active ref failure
isn't the only thing which may fail with ERR_PTR(-ENODEV).
@ops->seq_start/next() may also fail with the error value and
kernfs_seq_stop() doesn't have a way to tell apart those failures.
Work it around by factoring out the active part of kernfs_seq_stop()
into kernfs_seq_stop_active() and invoking it directly if
@ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
kernfs_seq_stop() to skip kernfs_seq_stop_active() on
ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put
is skipped iff get_active failed in kernfs_seq_start().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 6f96b3063c)
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 85dd0707f0)
- Add cache=mmap option
- Make mmap read-write while keeping it as synchronous as possible
- Build writeback fid on mmap creation if it is writable
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A set of fixes which makes sure we are taking the ilock whenever accessing the
extent list. This was associated with "Access to block zero" messages which
may result in extent list corruption.
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Spotted by Andy Price. This should fix the odd messages from
lockdep caused by 70d4ee94b3
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Price <anprice@redhat.com>
In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not,
we restore the orig_bio but failed to increment bi_remaining for
orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix
is to increment bi_remaining when we restore the orig bio as well.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
CC: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Reviewed-by: Chris Mason <clm@fb.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In bio_endio if bio doesn't have bi_end_io (should be an error case),
we set bio to NULL and continue silently without freeing the bio. It
would be good to have a WARN and free the bio to avoid memory leak.
Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds four new fields to directory leaf blocks.
The intent is not to use them in the kernel itself, although
perhaps we may be able to use them as hints at some later date,
but instead to provide more information for debug/fsck use.
One new field adds a pointer to the inode to which the leaf
belongs. This can be useful if the pointer to the leaf block
has become corrupt, as it will allow us to know which inode
this block should be associated with. This field is set when
the leaf is created and never changed over its lifetime.
The second field is a "distance from the hash table" field.
The meaning is as follows:
0 = An old leaf in which this value has not been set
1 = This leaf is pointed to directly from the hash table
2+ = This leaf is part of a chain, pointed to by another leaf
block, the value gives the position in the chain.
The third and fourth fields combine to give a time stamp of
the most recent directory insertion or deletion from this
leaf block. The time stamp is not updated when a new leaf
block is chained from the current one. The code is currently
written such that the timestamp on the dir inode will match
that of the leaf block for the most recent insertion/deletion.
For backwards compatibility, any of these new fields which is
zero should be considered to be "unknown".
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For most cases, only a single new block is needed when we reach
the point of converting from stuffed to exhash directory. The
exception being when the file name is so long that it will not
fit within the new leaf block.
So this patch adds a simple test for that situation so that we
do not need to request the full reservation size in this case.
Potentially we could calculate more accurately the value to use
in other cases too, but that is much more complicated to do and
it is doubtful that the benefit would outweigh the extra cost
in code complexity.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Previously during SSR and GC, the maximum number of retrials to find a victim
segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default.
This number makes an effect on IO locality, when SSR mode is activated, which
results in performance fluctuation on some low-end devices.
If max_victim_search = 4, the victim will be searched like below.
("D" represents a dirty segment, and "*" indicates a selected victim segment.)
D1 D2 D3 D4 D5 D6 D7 D8 D9
[ * ]
[ * ]
[ * ]
[ ....]
This patch adds a sysfs entry to control the number dynamically through:
/sys/fs/f2fs/$dev/max_victim_search
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.
N: Node IO, D: Data IO, IO scheduler: cfq
Issue pending IOs
D1 D2 D3 D4
D1 D2 D3 D4 N1
D2 D3 D4 N1 N2
N1 D3 D4 N2 D1
--> N1 can be selected by cfq becase of the same priority of N and D.
Then D3 and D4 would be delayed, resuling in performance degradation.
So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As a temporary fix, nfsd was breaking all leases on unlink, link,
rename, and setattr.
Now that we can distinguish between leases and delegations, we can be
nicer and break only the delegations, and not bother lease-holders with
operations they don't care about.
And we get to delete some code while we're at it.
Note that in the presence of delegations the vfs calls here all return
-EWOULDBLOCK instead of blocking, so nfsd threads will not get stuck
waiting for delegation returns.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is harmless, since ext4_walk_page_buffers only passes the handle
onto the callback function, and in this call site the function in
question, bput_one(), doesn't actually use the handle. But there's no
point passing in an invalid handle, and it creates a Coverity warning,
so let's just clean it up.
Addresses-Coverity-Id: #1091168
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A missing cast means that when we are truncating a file which is less
than 60 bytes, we don't clear the correct area of memory, and in fact
we can end up truncating the next inode in the inode table, or worse
yet, some other kernel data structure.
Addresses-Coverity-Id: #751987
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
This patch fixed several typo in printk from various
part of kernel source.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch calls get_write_access in function gfs2_setattr_chown,
which merely increases inode->i_writecount for the duration of the
function. That will ensure that any file closes won't delete the
inode's multi-block reservation while the function is running.
It also ensures that a multi-block reservation exists when needed
for quota change operations during the chown.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If failed after calling alloc_session but before init_session, nfsd will call __free_session to
free se_slots in session. But, session->se_fchannel.maxreqs is not initialized (value is zero).
So that, the memory malloced for slots will be lost in free_session_slots for maxreqs is zero.
This path sets the information for channel in alloc_session after mallocing slots succeed,
instead in init_session.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Can be reproduced by xfstests 62 with bigalloc and 128bit size inode.
Signed-off-by: Yongqiang Yang <yangyongqiang01@baidu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Use the new %pd printk() specifier in Ext4 to replace passing of
dentry name or dentry name and name length * 2 with just passing the
dentry.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
The function has a bit non-standard (for ext4) error recovery in that it
used a mix of 'out' labels and testing for 'handle' being NULL. There
isn't a good reason for that in the function so clean it up a bit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Similarly as other ->write_begin functions in ext4, also
ext4_da_write_inline_data_begin() should retry allocation if the
conversion failed because of ENOSPC. This avoids returning ENOSPC
prematurely because of uncommitted block deletions.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After applied this commit (d23142c6), ext4 has supported punch hole for
a file system with bigalloc feature. But we forgot to enable it. This
commit fixes it.
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize). It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.
The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Without CONFIG_NFSD_V3, compile will get warning as,
fs/nfsd/nfssvc.c: In function 'nfsd_svc':
>> fs/nfsd/nfssvc.c:246:60: warning: array subscript is above array bounds [-Warray-bounds]
return (nfsd_versions[2] != NULL) || (nfsd_versions[3] != NULL);
^
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we look to see if there is enough space to add a dir
entry without allocation, we have then been repeating the
same search later when we do the actual insertion. This
patch caches the details of the location in the gfs2_diradd
structure, so that we do not have to repeat the search.
This will provide a performance improvement which will be
greater as the size of the directory increases.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are three cases where we need to calculate the number of
blocks to reserve in a transaction involving linking an inode
into a directory. The one in rename is a bit more complicated,
but the basis of it is the same as for link and create. So it
makes sense to move this calculation into a single function
rather than repeating it three times.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The intent is that this structure will hold the information
required when adding entries to a directory (linking). To
start with, it will contain only the number of blocks which
are required to link the new entry into the directory. The
current calculation returns either 0 or the maximim number of
blocks that can ever be requested by such a transaction.
The intent is that in a later patch, we can update the dir
code to calculate this value more accurately. In addition
further patches will also add further fields to the new
structure to increase its utility.
In addition this patch fixes a bug where the link used during
inode creation was adding requesting too many blocks in
some cases. This is harmless unless the fs is close to being
full.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is a case which could read inline page data not from first page.
1. write inline data
2. lseek to offset 4096
3. read 4096 bytes from offset 4096
(read_inline_data read inline data page to non-first page,
And previously VFS has add this page to page cache)
4. ftruncate offset 8192
5. read 4096 bytes from offset 4096
(we meet this updated page with inline data in cache)
So we should leave this page with inited data and uptodate flag
for this case.
Change log from v1:
o fix a deadlock bug
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o reduce unneeded memset in __f2fs_convert_inline_data
>From 58796be2bd2becbe8d52305210fb2a64e7dd80b6 Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2.yu@samsung.com>
Date: Mon, 30 Dec 2013 09:21:33 +0800
Subject: [PATCH] f2fs: avoid to left uninitialized data in page when read
inline data
We left uninitialized data in the tail of page when we read an inline data
page. So let's initialize left part of the page excluding inline data region.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The truncate_partial_nodes puts pages incorrectly in the following two cases.
Note that the value for argc 'depth' can only be 2 or 3.
Please see truncate_inode_blocks() and truncate_partial_nodes().
1) An err is occurred in the first 'for' loop
When err is occurred with depth = 2, pages[0] is invalid, so this page doesn't
need to be put. There is no problem, however, when depth is 3, it doesn't put
the pages correctly where pages[0] is valid and pages[1] is invalid.
In this case, depth is set to 2 (ref to statemnt depth = i + 1), and then
'goto fail'.
In label 'fail', for (i = depth - 3; i >= 0; i--) cannot meet the condition
because i = -1, so pages[0] cann't be put.
2) An err happened in the second 'for' loop
Now we've got pages[0] with depth = 2, or we've got pages[0] and pages[1]
with depth = 3. When an err is detected, we need 'goto fail' to put such
the pages.
When depth is 2, in label 'fail', for (i = depth - 3; i >= 0; i--) cann't
meet the condition because i = -1, so pages[0] cann't be put.
When depth is 3, in label 'fail', for (i = depth - 3; i >= 0; i--) can
only put pages[0], pages[1] also cann't be put.
Note that 'depth' has been changed before first 'goto fail' (ref to statemnt
depth = i + 1), so passing this modified 'depth' to the tracepoint,
trace_f2fs_truncate_partial_nodes, is also incorrect.
Signed-off-by: Shifei Ge <shifei10.ge@samsung.com>
[Jaegeuk Kim: modify the description and fix one bug]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The get_dnode_of_data nullifies inode and node page when error is occurred.
There are two cases that passes inode page into get_dnode_of_data().
1. make_empty_dir()
-> get_new_data_page()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
2. f2fs_convert_inline_data()
-> __f2fs_convert_inline_data()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
This patch adds correct error handling codes when get_dnode_of_data() returns
an error.
At first, f2fs_reserve_block() calls f2fs_put_dnode() whenever reserve_new_block
returns an error.
So, the rule of f2fs_reserve_block() is to nullify inode page when there is any
error internally.
Finally, two callers of f2fs_reserve_block() should call f2fs_put_dnode()
appropriately if they got an error since successful f2fs_reserve_block().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a inline_data recovery routine with the following policy.
[prev.] [next] of inline_data flag
o o -> recover inline_data
o x -> remove inline_data, and then recover data blocks
x o -> remove inline_data, and then recover inline_data
x x -> recover data blocks
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds the number of inline_data files into the status information.
Note that the number is reset whenever the filesystem is newly mounted.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o handle NULL pointer of grab_cache_page_write_begin() pointed by Chao Yu.
This patch refactors f2fs_convert_inline_data to check a couple of conditions
internally for deciding whether it needs to convert inline_data or not.
So, the new f2fs_convert_inline_data initially checks:
1) f2fs_has_inline_data(), and
2) the data size to be changed.
If the inode has inline_data but the size to fill is less than MAX_INLINE_DATA,
then we don't need to convert the inline_data with data allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_write_begin(), if f2fs_conver_inline_data() returns an error like
-ENOSPC, f2fs should call f2fs_put_page().
Otherwise, it is remained as a locked page, resulting in the following bug.
[<ffffffff8114657e>] sleep_on_page+0xe/0x20
[<ffffffff81146567>] __lock_page+0x67/0x70
[<ffffffff81157d08>] truncate_inode_pages_range+0x368/0x5d0
[<ffffffff81157ff5>] truncate_inode_pages+0x15/0x20
[<ffffffff8115804b>] truncate_pagecache+0x4b/0x70
[<ffffffff81158082>] truncate_setsize+0x12/0x20
[<ffffffffa02a1842>] f2fs_setattr+0x72/0x270 [f2fs]
[<ffffffff811cdae3>] notify_change+0x213/0x400
[<ffffffff811ab376>] do_truncate+0x66/0xa0
[<ffffffff811ab541>] vfs_truncate+0x191/0x1b0
[<ffffffff811ab5bc>] do_sys_truncate+0x5c/0xa0
[<ffffffff811ab78e>] SyS_truncate+0xe/0x10
[<ffffffff81756052>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In the punch_hole(), let's convert inline_data all the time for simplicity and
to avoid potential deadlock conditions.
It is pretty much not a big deal to do this.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
A fileid in NFS is a uint64. There are some occurrences where dprintk()
outputs a signed fileid. This leads to confusion and more difficult to
read debugging (negative fileids matching positive inode numbers).
Signed-off-by: Niels de Vos <ndevos@redhat.com>
CC: Santosh Pradhan <spradhan@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The correct way to check on IPV6_ADDR_SCOPE_LINKLOCAL is to check with
the ipv6_addr_src_scope function.
Currently this can't be work, because ipv6_addr_scope returns a int with
a mask of IPV6_ADDR_SCOPE_MASK (0x00f0U) and IPV6_ADDR_SCOPE_LINKLOCAL
is 0x02. So the condition is always false.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When starting without nfsv2 and nfsv3, nfsd does not need to start
lockd (and certainly doesn't need to fail because lockd failed to
register with the portmapper).
Reported-by: Gareth Williams <gareth@garethwilliams.me.uk>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 clients can contact port 2049 directly instead of needing the
portmapper.
Therefore a failure to register to the portmapper when starting an
NFSv4-only server isn't really a problem.
But Gareth Williams reports that an attempt to start an NFSv4-only
server without starting portmap fails:
#rpc.nfsd -N 2 -N 3
rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused)
rpc.nfsd: unable to set any sockets for nfsd
Add a flag to svc_version to tell the rpc layer it can safely ignore an
rpcbind failure in the NFSv4-only case.
Reported-by: Gareth Williams <gareth@garethwilliams.me.uk>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
the length for backchannel checking should be multiplied by sizeof(__be32).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
check_forechannel_attrs gets drc memory, so nfsd must put it when
check_backchannel_attrs fails.
After many requests with bad back channel attrs, nfsd will deny any
client's CREATE_SESSION forever.
A new test case named CSESS29 for pynfs will send in another mail.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 5b6feee960 forgot
recording the back channel attrs in nfsd4_session.
nfsd just check the back channel attars by check_backchannel_attrs,
but do not record it in nfsd4_session in the latest kernel.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NULL return of kmem_cache_zalloc should be handled in jffs2_alloc_xattr_datum
and jff2_alloc_xattr_ref.
Signed-off-by: Zhouyi Zhou <yizhouzhou@ict.ac.cn>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Since defined in Linux-2.6.12-rc2, READTIME has not been used.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
host_err was only used for nfs4_acl_new.
This patch delete it, and return nfserr_jukebox directly.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Get rid of the extra code, using nfsd4_encode_noop for encoding destroy_session and free_stateid.
And, delete unused argument (fr_status) int nfsd4_free_stateid.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We should use XDR_LEN to calculate reserved space in case the oid is not
a multiple of 4.
RESERVE_SPACE actually rounds up for us, but it's probably better to be
careful here.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Prior to this patch, GFS2 had one address space for each rgrp,
stored in the glock. This patch changes them to use a single
address space in the super block. This therefore saves
(sizeof(struct address_space) * nr_of_rgrps) bytes of memory
and for large filesystems, that can be significant.
It would be nice to be able to do something similar and merge
the inode metadata address space into the same global
address space. However, that is rather more complicated as the
on-disk location doesn't have a 1:1 mapping with the inodes in
general. So while it could be done, it will be a more complicated
operation as it requires changing a lot more code paths.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Each rgrp header is represented as a single extent on disk, so we
can calculate the position within the address space, since we are
using address spaces mapped 1:1 to the disk. This means that it
is possible to use the range based versions of filemap_fdatawrite/wait
and for invalidating the page cache.
Our eventual intent is to then be able to merge the address spaces
used for rgrps into a single address space, rather than to have
one for each glock, saving memory and reducing complexity.
Since during umount, the rgrp structures are disposed of before
the glocks, we need to store the extent information in the glock
so that is is available for a final invalidation. This patch uses
a field which is otherwise unused in rgrp glocks to do that, so
that we do not have to expand the size of a glock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since gfs2_inplace_reserve() is always called with a valid
alloc parms structure, there is no need to test for this
within the function itself - and in any case, after we've
all ready dereferenced it anyway.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is only one place this is used, when reading in the quota
changes at mount time. It is not really required and much
simpler to just convert the fields from the on-disk structure
as required.
There should be no functional change as a result of this patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For historical reasons, we drop and retake the log lock in ->releasepage()
however, since there is no reason why we cannot hold the log lock over
the whole function, this allows some simplification. In particular,
pinning a buffer is only ever done under the log lock, so it is possible
here to remove the test for pinned buffers in the second loop, since it
is impossible for that to happen (it is also tested in the first loop).
As a result, two tests made later in the second loop become constants
and can also be reduced to the only possible branch. So the net result
is to remove various bits of unreachable code and make this more
readable.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With the preceding patch, we started accepting block reservations
smaller than the ideal size, which requires a lot more parsing of the
bitmaps. To reduce the amount of bitmap searching, this patch
implements a scheme whereby each rgrp keeps track of the point
at this multi-block reservations will fail.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is just basically a resend of a patch I posted earlier.
It didn't change from its original, except in diff offsets, etc:
This patch fixes a bug in the GFS2 block allocation code. The problem
starts if a process already has a multi-block reservation, but for
some reason, another process disqualifies it from further allocations.
For example, the other process might set on the GFS2_RDF_ERROR bit.
The process holding the reservation jumps to label skip_rgrp, but
that label comes after the code that removes the reservation from the
tree. Therefore, the no longer usable reservation is not removed from
the rgrp's reservations tree; it's lost. Eventually, the lost reservation
causes the count of reserved blocks to get off, and eventually that
causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
This patch moves the call to after label skip_rgrp so that the
disqualified reservation is properly removed from the tree, thus keeping
the rgrp rd_reserved count sane.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is a second try at a patch I posted earlier, which also implements
suggestions Steve made:
Before this patch, GFS2 would keep searching through all the rgrps
until it found one that had a chunk of free blocks big enough to
satisfy the size hint, which is based on the file write size,
regardless of whether the chunk was big enough to perform the write.
However, when doing big writes there may not be a large enough
chunk of free blocks in any rgrp, due to file system fragmentation.
The largest chunk may be big enough to satisfy the write request,
but it may not meet the ideal reservation size from the "size hint".
The writes would slow to a crawl because every write would search
every rgrp, then finally give up and default to a single-block write.
In my case, performance would drop from 425MB/s to 18KB/s, or 24000
times slower.
This patch basically makes it so that if we can't find a contiguous
chunk of blocks big enough to satisfy the sizehint, we'll use the
largest chunk of blocks we found that will still contain the write.
It does so by keeping track of the largest run of blocks within the
rgrp.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
commit 557ce2646e
"nfsd41: replace page based DRC with buffer based DRC"
have remove unused nfsd4_set_statp, but miss the function definition.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit 58cd57bfd9
"nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID"
miss calculating the length of bitmap for spo_must_enforce and spo_must_allow.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Merge patches from Andrew Morton:
"Ten fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test()
mm/memory-failure.c: transfer page count from head page to tail page after split thp
MAINTAINERS: set up proper record for Xilinx Zynq
mm: remove bogus warning in copy_huge_pmd()
memcg: fix memcg_size() calculation
mm: fix use-after-free in sys_remap_file_pages
mm: munlock: fix deadlock in __munlock_pagevec()
mm: munlock: fix a bug where THP tail page is encountered
The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
commmit 67347fe4e6 ("epoll: do not take global 'epmutex' for simple
topologies").
The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.
The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon. The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock. Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s_umount which is copied in from the core vfs, two patches
relate to a hard to hit "use after free" and memory leak.
Two patches related to using DIO and buffered I/O on the same
file to ensure correct operation in relation to glock state
changes. The final patch adds an RCU read lock to ensure
correct locking on an error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJSxVxyAAoJEMrg3m4a/8jSP1kQAKkyW6DgevgZ+IHlm5+mhTeZ
Bpdy3l6DdxZIiqoG0VqJo6DoeR4td+1q7TfyjpvFvgxjU/m/nLhKFcNd1A6TN3OK
G9Y6q0k0aWsAUPUjg3Y6gFRAlXHQaGXQ3nMDmoTCdYSYqid8gB+oqPbfwf5uHAgU
GVPgKxqSsJmzxPYTjpjx8mdpgiwCHa+iB+reoqxNSdxJnAk93GrBA7efonNoxKB1
r8VJlgkJubMjxGMu6xQYLMyt1Xed85sbiASOdE+Thw700tBA/ZAtKuB8xZ4+X1Fd
M5osKYnqodde+A3aSi6P7b+M6N+WyA/7bHhckbaQy8cwpC9xhgEqsEsIEFm0eJjB
wbdGe2tsCTUvLy37++D5e88cF9O2F6Ku0MJJtb7KsTLZPFD9XXs/6/xx4vSSNKQt
FC7BF5dkQiLDJvy1xvcHK43+PbOaS7/8WM1NuoNAS/L/3RYFrrHby3LqBo+kcUbV
L9HoL8aJd60bsX7PceXA9UzaH8yk/yTgeyOtd2+VCiRVldvNtx32ylTJLUqqxeRi
AL/tZWgxwPKb54AJMptPZ0fGP5A+pUhQgTm7fJCwrUdXQXWUW0YYK2sV3H9BZ8px
Ga0PuJtjxj8OkGFwnugEtuQNGQ9M5uCX4UiELqP3rVRNpq4e8UkOZRqtHsU7urSB
ezufwdI+b+uHUucva31D
=KSDi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull GFS2 fixes from Steven Whitehouse:
"Here is a set of small fixes for GFS2. There is a fix to drop
s_umount which is copied in from the core vfs, two patches relate to a
hard to hit "use after free" and memory leak. Two patches related to
using DIO and buffered I/O on the same file to ensure correct
operation in relation to glock state changes. The final patch adds an
RCU read lock to ensure correct locking on an error path"
* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Fix unsafe dereference in dump_holder()
GFS2: Wait for async DIO in glock state changes
GFS2: Fix incorrect invalidation for DIO/buffered I/O
GFS2: Fix slab memory leak in gfs2_bufdata
GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
GFS2: don't hold s_umount over blkdev_put
There is a potential overflow if the specified EA value size is
greater than USHRT_MAX because the size of value is limited by
the on-disk format (i.e, __le16), this issue could be reflected
via the tests below:
# touch /jfs/testfile
# setfattr -n user.comment -v `perl -e 'print "A"x65536'` /jfs/testfile
setfattr: /jfs/testfile: Invalid argument
Syslog:
... jfs_xsetattr: xattr_size = 21, new_size = 65557
This patch add pre-checkups of EA value size against USHRT_MAX to
avoid this problem, and return -E2BIG which is consistent with the
VFS setxattr interface. Moreover, fix the debug code to print the
correct function name.
With this fix:
setfattr: /jfs/testfile: Argument list too long
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that
RCU read lock is held when using task_struct returned from pid_task().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Currently, if one new page allocated into fscache in readpage(), however,
with no data read into due to error encountered during reading from OSDs,
the slot in fscache is not uncached. This patch fixes this.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
Adds cap check to the page fault handler. The check prevents page
fault handler from adding new page to the page cache while Fcb caps
are being revoked. This solves Fc revoking hang in multiple clients
mmap IO workload.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Set FILE_CREATED on O_CREAT|O_EXCL.
cifs code didn't change during commit 116cc02253
Kernel bugzilla 66251
Signed-off-by: Shirish Pargaonkar <spargaonkar@suse.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
When we obtain tcon from cifs_sb, we use cifs_sb_tlink() to first obtain
tlink which also grabs a reference to it. We do not drop this reference
to tlink once we are done with the call.
The patch fixes this issue by instead passing tcon as a parameter and
avoids having to obtain a reference to the tlink. A lookup for the tcon
is already made in the calling functions and this way we avoid having to
re-run the lookup. This is also consistent with the argument list for
other similar calls for M-F symlinks.
We should also return an ENOSYS when we do not find a protocol specific
function to lookup the MF Symlink data.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
This patch locates checking the inline_data prior to calling f2fs_lock_op()
in truncate_blocks(), since getting the lock is unnecessary.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
and a patch so that instead of BUG'ing we use the ext4_error()
framework to mark the file system is corrupted.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABCAAGBQJSuSZnAAoJENNvdpvBGATwp/8P/R/VjV1O4IhDmRxbc7mNAkoB
Mfh2Utnk9daGdMpSMhzWW6m3oyohA0ICledBusQ3ax6Ymg8jIcmGwm3rJ8gAXvR1
4g0rQ1nw3JEGROId58FnKB3fsEmOPlt4T/LKL4boY6BfER4yu1htH0zSKBuKqykt
feH9dMiaR1KMQ613eWY6GEonYaP8+nI1GxEfvrymInxznDPVuaLgR4oBMAmR8R76
9vfJfFHYjbk1wQ5UEv94tic8Hi055PGCRfsLc79QwxMr5KyKz+NydDUIjgKjP9pu
9sz8iuV79M5/hUguZY7HH9Xd0byZ+jPuNrpkrDqSNZYuArfIcsXKZM/dm7HOgFGQ
dQzf9S/kBzJvcSHuUchhS2cm6kxCsHaqo16Fxs5kP3TmB3TrVr7EV6uBS4cm53PJ
x6IdAORhbURfuJCRQOi/TDNUrb+ZHvIx7Gc1ujizczC3An7QurfYo7XY/rWfdj41
eIVy0+1gqvWJsbXGInni1hKbXMU3yTJ0MqQm05A7MW/G2G6eIgEVpz8MElm33jEE
VvC6KyZxpviRYPUmxxTbSa1vl0UG1rZdZXslgmlSyY1yItVmyTCIAG217JOTyhTX
Ae1aZEgzLYh6dQAwweme79WF4WsBPP28TOmW2xoOH7t04gMG0t+9b/RbUDtTPItc
HXNmIlFP9CULIQ1c2Cvh
=KPNa
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A collection of bug fixes destined for stable and some printk cleanups
and a patch so that instead of BUG'ing we use the ext4_error()
framework to mark the file system is corrupted"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add explicit casts when masking cluster sizes
ext4: fix deadlock when writing in ENOSPC conditions
jbd2: rename obsoleted msg JBD->JBD2
jbd2: revise KERN_EMERG error messages
jbd2: don't BUG but return ENOSPC if a handle runs out of space
ext4: Do not reserve clusters when fs doesn't support extents
ext4: fix del_timer() misuse for ->s_err_report
ext4: check for overlapping extents in ext4_valid_extent_entries()
ext4: fix use-after-free in ext4_mb_new_blocks
ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
Hook inline data read/write, truncate, fallocate, setattr, etc.
Files need meet following 2 requirement to inline:
1) file size is not greater than MAX_INLINE_DATA;
2) file doesn't pre-allocate data blocks by fallocate().
FI_INLINE_DATA will not be set while creating a new regular inode because
most of the files are bigger than ~3.4K. Set FI_INLINE_DATA only when
data is submitted to block layer, ranther than set it while creating a new
inode, this also avoids converting data from inline to normal data block
and vice versa.
While writting inline data to inode block, the first data block should be
released if the file has a block indexed by i_addr[0].
On the other hand, when a file operation is appied to a file with inline
data, we need to test if this file can remain inline by doing this
operation, otherwise it should be convert into normal file by reserving
a new data block, copying inline data to this new block and clear
FI_INLINE_DATA flag. Because reserve a new data block here will make use
of i_addr[0], if we save inline data in i_addr[0..872], then the first
4 bytes would be overwriten. This problem can be avoided simply by
not using i_addr[0] for inline data.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs supports 4KB block size. If user requests dwrite with under 4KB data,
it allocates a new 4KB data block.
However, f2fs doesn't add zero data into the untouched data area inside the
newly allocated data block.
This incurs an error during the xfstest #263 test as follow.
263 12s ... [failed, exit status 1] - output mismatch (see 263.out.bad)
--- 263.out 2013-03-09 03:37:15.043967603 +0900
+++ 263.out.bad 2013-12-27 04:20:39.230203114 +0900
@@ -1,3 +1,976 @@
QA output created by 263
fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+truncating to largest ever: 0x12a00
+truncating to largest ever: 0x75400
+fallocating to largest ever: 0x79cbf
...
(Run 'diff -u 263.out 263.out.bad' to see the entire diff)
Ran: 263
Failures: 263
Failed 1 of 1 tests
It turns out that, when the test tries to write 2KB data with dio, the new dio
path allocates 4KB data block without filling zero data inside the remained 2KB
area. Finally, the output file contains a garbage data for that region.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When get_dnode_of_data() in get_data_block() returns a successful dnode, we
should put the dnode.
But, previously, if its data block address is equal to NEW_ADDR, we didn't do
that, resulting in a deadlock condition.
So, this patch splits original error conditions with this case, and then calls
f2fs_put_dnode before finishing the function.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.
struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In current flow, we will get Null return value of f2fs_find_entry in
recover_dentry when name.len is bigger than F2FS_NAME_LEN, and then we
still add this inode into its dir entry.
To avoid this situation, we must check filename length before we use it.
Another point is that we could remove the code of checking filename length
In f2fs_find_entry, because f2fs_lookup will be called previously to ensure of
validity of filename length.
V2:
o add WARN_ON() as Jaegeuk Kim suggested.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull ext2 fix from Jan Kara:
"One simple fix of oops in ext2 which was recently hit by Christoph"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
Lockdep is complaining about UDF:
=============================================
[ INFO: possible recursive locking detected ]
3.12.0+ #16 Not tainted
---------------------------------------------
ln/7386 is trying to acquire lock:
(&ei->i_data_sem){+.+...}, at: [<ffffffff8142f06d>] udf_get_block+0x8d/0x130
but task is already holding lock:
(&ei->i_data_sem){+.+...}, at: [<ffffffff81431a8d>] udf_symlink+0x8d/0x690
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&ei->i_data_sem);
lock(&ei->i_data_sem);
*** DEADLOCK ***
This is because we hold i_data_sem of the symlink inode while calling
udf_add_entry() for the directory. I don't think this can ever lead to
deadlocks since we never hold i_data_sem for two inodes in any other
place.
The fix is simple - move unlock of i_data_sem for symlink inode up. We
don't need it for anything when linking symlink inode to directory.
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
When we rename a dir to new name which is not exist previous,
we will set pino of parent inode with ino of child inode in f2fs_set_link.
It destroy consistency of pino, it should be fixed.
Thanks for previous work of Shu Tan.
Signed-off-by: Shu Tan <shu.tan@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When using the f2fs_io_info in the low level, we still need to merge the
rw and rw_flag, so use the rw to hold all the io flags directly,
and remove the rw_flag field.
ps.It is based on the previous patch:
f2fs: move all the bio initialization into __bio_alloc
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move all the bio initialization into __bio_alloc, and some minor cleanups are
also added.
v3:
Use 'bool' rather than 'int' as Kim suggested.
v2:
Use 'is_read' rather than 'rw' as Yu Chao suggested.
Remove the needless initialization of bio->bi_private.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enhances writing dirty meta pages collectively in background.
During the file data writes, it'd better avoid to write small dirty meta pages
frequently.
So let's give a chance to collect a number of dirty meta pages for a while.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs doesn't support direct IOs with high performance, which throws
every write requests via the buffered write path, resulting in highly
performance degradation due to memory opeations like copy_from_user.
This patch introduces a new direct IO path in which every write requests are
processed by generic blockdev_direct_IO() with enhanced get_block function.
The get_data_block() in f2fs handles:
1. if original data blocks are allocates, then give them to blockdev.
2. otherwise,
a. preallocate requested block addresses
b. do not use extent cache for better performance
c. give the block addresses to blockdev
This policy induces that:
- new allocated data are sequentially written to the disk
- updated data are randomly written to the disk.
- f2fs gives consistency on its file meta, not file data.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>