* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
ext3: Fix data / filesystem corruption when write fails to copy data
ext4: Support for 64-bit quota format
ext3: Support for vfsv1 quota format
quota: Implement quota format with 64-bit space and inode limits
quota: Move definition of QFMT_OCFS2 to linux/quota.h
ext2: fix comment in ext2_find_entry about return values
ext3: Unify log messages in ext3
ext2: clear uptodate flag on super block I/O error
ext2: Unify log messages in ext2
ext3: make "norecovery" an alias for "noload"
ext3: Don't update the superblock in ext3_statfs()
ext3: journal all modifications in ext3_xattr_set_handle
ext2: Explicitly assign values to on-disk enum of filetypes
quota: Fix WARN_ON in lookup_one_len
const: struct quota_format_ops
ubifs: remove manual O_SYNC handling
afs: remove manual O_SYNC handling
kill wait_on_page_writeback_range
vfs: Implement proper O_SYNC semantics
...
By teaching sysfs_revalidate to hide a dentry for
a sysfs_dirent if the sysfs_dirent has been renamed,
and by teaching sysfs_lookup to return the original
dentry if the sysfs dirent has been renamed. I can
show the results of renames correctly without having to
update the dcache during the directory rename.
This massively simplifies the rename logic allowing a lot
of weird sysfs special cases to be removed along with
a lot of now unnecesary helper code.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment. This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.
This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag. To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.
This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition. Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set. The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.
We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.
Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op. We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (57 commits)
binfmt_elf: fix PT_INTERP bss handling
TPM: Fixup boot probe timeout for tpm_tis driver
sysfs: Add labeling support for sysfs
LSM/SELinux: inode_{get,set,notify}secctx hooks to access LSM security context information.
VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx.
KEYS: Add missing linux/tracehook.h #inclusions
KEYS: Fix default security_session_to_parent()
Security/SELinux: includecheck fix kernel/sysctl.c
KEYS: security_cred_alloc_blank() should return int under all circumstances
IMA: open new file for read
KEYS: Add a keyctl to install a process's session keyring on its parent [try #6]
KEYS: Extend TIF_NOTIFY_RESUME to (almost) all architectures [try #6]
KEYS: Do some whitespace cleanups [try #6]
KEYS: Make /proc/keys use keyid not numread as file position [try #6]
KEYS: Add garbage collection for dead, revoked and expired keys. [try #6]
KEYS: Flag dead keys to induce EKEYREVOKED [try #6]
KEYS: Allow keyctl_revoke() on keys that have SETATTR but not WRITE perm [try #6]
KEYS: Deal with dead-type keys appropriately [try #6]
CRED: Add some configurable debugging [try #6]
selinux: Support for the new TUN LSM hooks
...
This is stage one in flattening out the callchains for the common
permission testing. Rather than have most filesystem implement their
own inode->i_op->permission function that just calls back down to the
VFS layers 'generic_permission()' with the per-filesystem ACL checking
function, the filesystem can just expose its 'check_acl' function
directly, and let the VFS layer do everything for it.
This is all just preparatory - no filesystem actually enables this yet.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't call down to the generic inode_permission() function just to
call the inode-specific permission function - just do it directly.
The generic inode_permission() code does things like checking MAY_WRITE
and devcgroup_inode_permission(), neither of which are relevant for the
light pathname walk permission checks (we always do just MAY_EXEC, and
the inode is never a special device).
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is only called for path components that are already known
to be directories (they have a '->lookup' method). So don't bother
doing that whole S_ISDIR() testing, the whole point of the 'lite()'
version is that we know that we are looking at a directory component,
and that we're only checking name lookup permission.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of returning EAGAIN and having the caller do something
special for that case, just do the special case directly.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not only is that a supremely timing-critical path, but it's hopefully
some day going to be lockless for the common case, and ima can't do
that.
Plus the integrity code doesn't even care about non-regular files, so it
was always a total waste of time and effort.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- As ima_counts_put() may be called after the inode has been freed,
verify that the inode is not NULL, before dereferencing it.
- Maintain the IMA file counters in may_open() properly, decrementing
any counter increments on subsequent errors.
Reported-by: Ciprian Docan <docan@eden.rutgers.edu>
Reported-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com
Signed-off-by: James Morris <jmorris@namei.org>
When creating a new file, ima_path_check() assumed the new file
was being opened for write. Call ima_path_check() with the
appropriate acc_mode so that the read/write counters are
incremented correctly.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
This patch fixes an imbalance message as reported by J.R. Okajima.
The IMA file counters are incremented in ima_path_check. If the
actual open fails, such as ETXTBSY, decrement the counters to
prevent unnecessary imbalance messages.
Reported-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
commit 2a73787110 "Cache root in nameidata"
introduced a new member nd->root, but forgot to put it in do_filp_open().
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New field: nd->root. When pathname resolution wants to know the root,
check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
copy current->fs->root there. After path_walk() is finished, we check
if we'd got a cached value in nd->root and drop it. Before calling
path_walk() we should either set nd->root.mnt to NULL *or* copy (and
pin down) some path to nd->root. In the latter case we won't be
looking at current->fs->root at all.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split do_path_lookup(), opencode the call from do_filp_open()
do_filp_open() is the only caller of do_path_lookup() that
cares about root afterwards (it keeps resolving symlinks on
O_CREAT path after it'd done LOOKUP_PARENT walk). So when
we start caching fs->root in path_walk(), it'll need a different
treatment.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a
bug to generic code which had been extant for a long time in the XFS
version -- it started to call through into lookup_one_len() and hence
into the file systems' ->lookup() methods without i_mutex held on the
directory.
This patch fixes it by locking the directory's i_mutex again before
calling the filldir functions. The original deadlocks which commit
14f7dd63 was designed to avoid are still avoided, because they were due
to fs-internal locking, not i_mutex.
While we're at it, fix the return type of nfsd_buffered_readdir() which
should be a __be32 not an int -- it's an NFS errno, not a Linux errno.
And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM.
Sparse would have caught that, if it wasn't so busy bitching about
__cold__.
Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
code") introduced a similar problem with calling lookup_one_len()
without i_mutex, which this patch also addresses. To fix that, it was
necessary to fix the called functions so that they expect i_mutex to be
held; that part was done by J. Bruce Fields.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Umm-I-can-live-with-that-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Tested-by: J. Bruce Fields <bfields@citi.umich.edu>
LKML-Reference: <8036.1237474444@jrobl>
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't pull it in sched.h; very few files actually need it and those
can include directly. sched.h itself only needs forward declaration
of struct fs_struct;
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pure code move; two new helper functions for nfsd and daemonize
(unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
the same code as used to be in callers). unshare_fs_struct()
exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
copy_fs_struct() and exit_fs() don't need exports anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
ext2: Zero our b_size in ext2_quota_read()
trivial: fix typos/grammar errors in fs/Kconfig
quota: Coding style fixes
quota: Remove superfluous inlines
quota: Remove uppercase aliases for quota functions.
nfsd: Use lowercase names of quota functions
jfs: Use lowercase names of quota functions
udf: Use lowercase names of quota functions
ufs: Use lowercase names of quota functions
reiserfs: Use lowercase names of quota functions
ext4: Use lowercase names of quota functions
ext3: Use lowercase names of quota functions
ext2: Use lowercase names of quota functions
ramfs: Remove quota call
vfs: Use lowercase names of quota functions
quota: Remove dqbuf_t and other cleanups
quota: Remove NODQUOT macro
quota: Make global quota locks cacheline aligned
quota: Move quota files into separate directory
ext4: quota reservation for delayed allocation
...
Add a switch for the various i_mode fmt cases, and remove the comment
about writeability of devices nodes - that part is handled in
inode_permission and comment on (briefly) there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
inotify: fix type errors in interfaces
fix breakage in reiserfs_new_inode()
fix the treatment of jfs special inodes
vfs: remove duplicate code in get_fs_type()
add a vfs_fsync helper
sys_execve and sys_uselib do not call into fsnotify
zero i_uid/i_gid on inode allocation
inode->i_op is never NULL
ntfs: don't NULL i_op
isofs check for NULL ->i_op in root directory is dead code
affs: do not zero ->i_op
kill suid bit only for regular files
vfs: lseek(fd, 0, SEEK_CUR) race condition
We used to have rather schizophrenic set of checks for NULL ->i_op even
though it had been eliminated years ago. You'd need to go out of your
way to set it to NULL explicitly _and_ a bunch of code would die on
such inodes anyway. After killing two remaining places that still
did that bogosity, all that crap can go away.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[AV: rediffed on top of unification of init_fs]
Initialization of init_fs still uses the deprecated RW_LOCK_UNLOCKED macro.
This patch updates it to use the __RW_LOCK_UNLOCKED(lock) macro.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With all the nameidata removal there's no point anymore for this helper.
Of the three callers left two will go away with the next lookup series
anyway.
Also add proper kerneldoc to inode_permission as this is the main
permission check routine now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need for the nameidata in may_open - a struct path is enough.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
walk_init_root is a tiny helper that is marked __always_inline, has just
one caller and an unused argument. Just merge it into the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We now pass on all MAY_ flags to the filesystems permission routines,
so remove the comment stating the contrary.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On-disk data corruption could cause a page link to have its i_size set
to PAGE_SIZE (or a multiple thereof) and its contents all non-NUL.
NUL-terminate the link name to ensure this doesn't cause further
problems for the kernel.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add new LSM hooks for path-based checks. Call them on directory-modifying
operations at the points where we still know the vfsmount involved.
Signed-off-by: Kentaro Takeda <takedakn@nttdata.co.jp>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Toshiharu Harada <haradats@nttdata.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/nfsd/nfs4recover.c
Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().
Signed-off-by: James Morris <jmorris@namei.org>
Peter Cordes is sorry that he rm'ed his swapfiles while they were in use,
he then had no pathname to swapoff. It's a curious little oversight, but
not one worth a lot of hackery. Kudos to Willy Tarreau for turning this
around from a discussion of synthetic pathnames to how to prevent unlink.
Mimic immutable: prohibit unlinking an active swapfile in may_delete()
(and don't worry my little head over the tiny race window).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Willy Tarreau <w@1wt.eu>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Peter Cordes <peter@cordes.ca>
Cc: Bodo Eggert <7eggert@gmx.de>
Cc: David Newall <davidn@davidnewall.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
For execute permission on a regular files we need to check if file has
any execute bits at all, regardless of capabilites.
This check is normally performed by generic_permission() but was also
added to the case when the filesystem defines its own ->permission()
method. In the latter case the filesystem should be responsible for
performing this check.
Move the check from inode_permission() inside filesystems which are
not calling generic_permission().
Create a helper function execute_ok() that returns true if the inode
is a directory or if any execute bits are present in i_mode.
Also fix up the following code:
- coda control file is never executable
- sysctl files are never executable
- hfs_permission seems broken on MAY_EXEC, remove
- hfsplus_permission is eqivalent to generic_permission(), remove
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This adds LOOKUP_RENAME_TARGET intent for lookup of rename destination.
LOOKUP_RENAME_TARGET is going to be used like LOOKUP_CREATE. But since
the destination of rename() can be existing directory entry, so it has a
difference. Although that difference doesn't matter in my usage, this
tells it to user of this intent.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
lookup_hash() with LOOKUP_PARENT is bogus. And this prepares to add
new intent on those path.
The user of LOOKUP_PARENT intent is nfs only, and it checks whether
nd->flags has LOOKUP_CREATE or LOOKUP_OPEN, so the result is same.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
This adds d_ancestor() instead of d_isparent(), then use it.
If new_dentry == old_dentry, is_subdir() returns 1, looks strange.
"new_dentry == old_dentry" is not subdir obviously. But I'm not
checking callers for now, so this keeps current behavior.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
New flag: LOOKUP_EXCL. Set before doing the final step of pathname
resolution on the paths that have LOOKUP_CREATE and O_EXCL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't pass bogus flags when we are just looking for parent.
Fold __path_lookup_intent_open() into path_lookup_open() while we
are at it; that's the only remaining caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for July 17: early crash on x86-64)
SELinux needs MAY_APPEND to be passed down to the security hook.
Otherwise, we get permission denials when only append permission is
granted by policy even if the opening process specified O_APPEND.
Shows up as a regression in the ltp selinux testsuite, fixed by
this patch.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* do not pass nameidata; struct path is all the callers want.
* switch to new helpers:
user_path_at(dfd, pathname, flags, &path)
user_path(pathname, &path)
user_lpath(pathname, &path)
user_path_dir(pathname, &path) (fail if not a directory)
The last 3 are trivial macro wrappers for the first one.
* remove nameidata in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* MAY_CHDIR is redundant - it's an equivalent of MAY_ACCESS
* MAY_ACCESS on fuse should affect only the last step of pathname resolution
* fchdir() and chroot() should pass MAY_ACCESS, for the same reason why
chdir() needs that.
* now that we pass MAY_ACCESS explicitly in all cases, LOOKUP_ACCESS can be
removed; it has no business being in nameidata.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... so we ought to pass MAY_CHDIR to vfs_permission() instead of having
it triggered on every step of preceding pathname resolution. LOOKUP_CHDIR
is killed by that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the unused mode parameter from vfs_symlink and callers.
Thanks to Tetsuo Handa for noticing.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Why not reuse "inode" which is assigned as
struct inode *inode = old_dentry->d_inode;
in the beginning of vfs_link() ?
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Lookup can install a child dentry for a deleted directory. This keeps
the directory dentry alive, and the inode pinned in the cache and on
disk, even after all external references have gone away.
This isn't a big problem normally, since memory pressure or umount
will clear out the directory dentry and its children, releasing the
inode. But for UBIFS this causes problems because its orphan area can
overflow.
Fix this by returning ENOENT for all lookups on a S_DEAD directory
before creating a child dentry.
Thanks to Zoltan Sogor for noticing this while testing UBIFS, and
Artem for the excellent analysis of the problem and testing.
Reported-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Tested-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_readlink calls ERR_PTR for negative and positive values
(vfs_readlink returns length of "link"), but it should not
(not an errno) and does not need to.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case when both EEXIST and EROFS would apply we used to
return the former in mkdir(2) and friends. Lest anyone suspects
us of being consistent, in the same situation knfsd gave clients
nfs_erofs...
ro-bind series had switched the syscall side of things to
returning -EROFS and immediately broke an application - namely,
mkdir -p. Patch restores the original behaviour...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Implement a cgroup to track and enforce open and mknod restrictions on device
files. A device cgroup associates a device access whitelist with each cgroup.
A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block).
'all' means it applies to all types and all major and minor numbers. Major
and minor are either an integer or * for all. Access is a composition of r
(read), w (write), and m (mknod).
The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of
the parent. Admins can then remove devices from the whitelist or add new
entries. A child cgroup can never receive a device access which is denied its
parent. However when a device access is removed from a parent it will not
also be removed from the child(ren).
An entry is added using devices.allow, and removed using
devices.deny. For instance
echo 'c 1:3 mr' > /cgroups/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing
echo a > /cgroups/1/devices.deny
will remove the default 'a *:* mrw' entry.
CAP_SYS_ADMIN is needed to change permissions or move another task to a new
cgroup. A cgroup may not be granted more permissions than the cgroup's parent
has. Any task can move itself between cgroups. This won't be sufficient, but
we can decide the best way to adequately restrict movement later.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix may-be-used-uninitialized warning]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Looks-good-to: Pavel Emelyanov <xemul@openvz.org>
Cc: Daniel Hokka Zakrisson <daniel@hozac.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.
We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.
There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This also uses the little helper in the NFS code to make an if() a little bit
less ugly. We introduced the helper at the beginning of the series.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.
This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.
[AV: merged mkdir handling, added missing nfsd pieces]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Elevate the write count during the vfs_rmdir() and vfs_unlink().
[AV: merged rmdir and unlink parts, added missing pieces in nfsd]
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations. It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.
This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken. That makes it a bad interface.
All that the current do_filp_open() really does is allocate the
nameidata on the stack, then call open_namei().
So, this merges those two functions and moves filp_open() over
to namei.c so it can be close to its buddy: do_filp_open(). It
also gets a kerneldoc comment in the process.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My end goal here is to make sure all users of may_open()
return filps. This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().
This patch moves the sys_open flags to namei flags
calculation into fs/namei.c. We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get stack footprint of pathname resolution back to relative sanity
[PATCH] double iput() on failure exit in hugetlb
[PATCH] double dput() on failure exit in tiny-shmem
[PATCH] fix up new filp allocators
[PATCH] check for null vfsmount in dentry_open()
[PATCH] reiserfs: eliminate private use of struct file in xattr
[PATCH] sanitize hppfs
hppfs pass vfsmount to dentry_open()
[PATCH] restore export of do_kern_mount()
Fix kernel-doc notation warnings in fs/.
Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
* mark_files_ro
Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
* lease_get_mtime
Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
* lease_get_mtime
Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
* lookup_one_len: filesystem helper to lookup single pathname component
Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
* bh_uptodate_or_lock: Test whether the buffer is uptodate
Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
* bh_submit_read: Submit a locked buffer for reading
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
* writeback_acquire: attempt to get exclusive writeback access to a device
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
* writeback_in_progress: determine whether there is writeback in progress
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
* writeback_release: relinquish exclusive writeback access against a device.
Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
* void journal_invalidatepage()
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Somebody had put struct nameidata in stack frame of link_path_walk().
Unfortunately, there are certain realities to deal with:
* It's in the middle of recursion. Depth is equal to the nesting
depth of symlinks, i.e. up to 8.
* struct namiedata is, even if one discards the intent junk,
at least 12 pointers + 5 ints.
* moreover, adding a stack frame is not free in that situation.
* there are fs methods called on top of that, and they also have
stack footprint.
* kernel stack is not infinite.
The thing is, even if one chooses to deal with -ESTALE that way (and it's
one hell of an overkill), the only thing that needs to be preserved is
vfsmount + dentry, not the entire struct nameidata.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Use struct path in fs_struct.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This introduces the symmetric function to path_put() for getting a reference
to the dentry and vfsmount of a struct path in the right order.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use path_put() in a few places instead of {mnt,d}put()
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
path_release_on_umount() should only be called from sys_umount(). I merged the
function into sys_umount() instead of having in in namei.c.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>