'write_op' was still used, even though it was always WRITE_SYNC now.
Add plugging around the cases where it submits IO, and flush them
before we end up waiting for that IO.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
'write_op' was still used, even though it was always WRITE_SYNC now.
Add plugging around the cases where it submits IO, and flush them
before we end up waiting for that IO.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It used WRITE_SYNC_PLUG before and potentially submits a batch
of IO, so lets enable plugging for this case.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Not all block drivers clear events immediately after reporting. Some
do so in ->revalidate_disk() or other steps during ->open(). There is
a slim chance event poll may happen between the clearing event check
from check_disk_change() and the actual clearing of the events which
would result in spurious events.
Block event checks while block device open is in progress. There is
no need to kick explicit event check afterwards as events are always
checked during open.
-v2: The original patch could have called disk_unblock_events() with
an already released or %NULL @disk causing oops. Fixed by making
sure references are put after disk_unblock_events() is called.
It also makes the error path of __blkdev_get() a bit simpler.
This problem was reported by Jens.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
The block event mechanism currently always checks events when the
device is being closed regardless of the open mode. The intention was
to allow detection of EJECT_REQUEST when a device is closed whether
disk event polling is enabled or not.
This is unnecessary as, for devices of interest, events are checked
from either userland or kernel and in the former case ->check_events()
is performed on open of each poll attempt anyway. Furthermore, this
unconditional event check on close makes the code susceptible to event
loop if the block driver doesn't clear reported events correctly - an
event triggers userland to open and close the device which in turn
causes another event, rinse and repeat.
Check events on close only if it was blocked by excl write open.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Currently, disk_unblock_events() implicitly kick event check if the
block count reaches zero. This behavior is not described in the
comment and hinders with future changes. Make the unblocker
explicitly check events by calling disk_check_events() as necessary.
This patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
The block integrity subsystem no longer uses the bio_vec slabs so this
code can safely be compiled in.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix new kernel-doc warning in fs/block_dev.c:
Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix truncate after open
fuse: fix hang of single threaded fuseblk filesystem
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Check heartbeat mode for kernel stacks only
Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number.
ocfs2: Fix estimate of necessary credits for mkdir
A race can occur when io_submit() races with io_destroy():
CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs
We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.
Fix the bug by atomically testing for zero refcount before incrementing.
[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions. A
kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.
The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Acked-by: Richard Russon <ldm@flatcap.org>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several places, an epoll fd can call another file's ->f_op->poll()
method with ep->mtx held. This is in general unsafe, because that other
file could itself be an epoll fd that contains the original epoll fd.
The code defends against this possibility in its own ->poll() method using
ep_call_nested, but there are several other unsafe calls to ->poll
elsewhere that can be made to deadlock. For example, the following simple
program causes the call in ep_insert recursively call the original fd's
->poll, leading to deadlock:
#include <unistd.h>
#include <sys/epoll.h>
int main(void) {
int e1, e2, p[2];
struct epoll_event evt = {
.events = EPOLLIN
};
e1 = epoll_create(1);
e2 = epoll_create(2);
pipe(p);
epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
write(p[1], p, sizeof p);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
return 0;
}
On insertion, check whether the inserted file is itself a struct epoll,
and if so, do a recursive walk to detect whether inserting this file would
create a loop of epoll structures, which could lead to deadlock.
[nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Reported-by: Nelson Elhage <nelhage@ksplice.com>
Tested-by: Nelson Elhage <nelhage@ksplice.com>
Cc: <stable@kernel.org> [2.6.34+, possibly earlier]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover
I'm seeing the following oops when testing afs:
Unable to handle kernel paging request for data at address 0x00000008
...
NIP [c0000000003393b0] .afs_unlink_writeback+0x38/0xc0
LR [c00000000033987c] .afs_put_writeback+0x98/0xec
Call Trace:
[c00000000345f600] [c00000000033987c] .afs_put_writeback+0x98/0xec
[c00000000345f690] [c00000000033ae80] .afs_write_begin+0x6a4/0x75c
[c00000000345f790] [c00000000012b77c] .generic_file_buffered_write+0x148/0x320
[c00000000345f8d0] [c00000000012e1b8] .__generic_file_aio_write+0x37c/0x3e4
[c00000000345f9d0] [c00000000012e2a8] .generic_file_aio_write+0x88/0xfc
[c00000000345fa90] [c0000000003390a8] .afs_file_write+0x10c/0x178
[c00000000345fb40] [c000000000188788] .do_sync_write+0xc4/0x128
[c00000000345fcc0] [c000000000189658] .vfs_write+0xe8/0x1d8
[c00000000345fd70] [c000000000189884] .SyS_write+0x68/0xb0
[c00000000345fe30] [c000000000008564] syscall_exit+0x0/0x40
afs_write_begin hits an error and calls afs_unlink_writeback. In there
we do list_del_init on an uninitialised list.
The patch below initialises ->link when creating the afs_writeback struct.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC
opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse
assumed that when called from open, a truncate() will be done, not an
ftruncate().
Fix by restoring the old behavior, based on the ATTR_OPEN flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
triggered a DESTROY request via path_put().
Fix this by
a) making RELEASE requests synchronous, whenever possible, on fuseblk
filesystems
b) if not possible (triggered by an asynchronous read/write) then do
the path_put() in a separate thread with schedule_work().
Reported-by: Oliver Neukum <oneukum@suse.de>
Cc: stable@kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The new implementation of bd_link_disk_holder() added by 49731baa41
(block: restore multiple bd_link_disk_holder() support) didn't get an
extra reference for the holder_dir kobject of the slave bdev; however,
bdev kills holder_dir on removal, not release, so if the slave bdev is
removed while there are holder links, the holder_dir will be destroyed
while there still are holder links, which leads to oops later when
bd_unlink_disk_order() tries to remove those links.
Make bd_link_disk_holder() grab an extra reference for the slave's
holder_dir and put it in bd_unlink_disk_holder().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Tested-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By the commit
b3e19d9 2011-01-07 fs: scale mntget/mntput
vfsmount_lock was introduced around testing mnt_count.
Fix the mis-typed 'unlock'
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.
In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.
In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.
flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.
invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.
invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.
So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.
flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.
dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.
md does use check_disk_size_change and so is affected.
This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.
Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"
Gurudas Pai reported the same bug on NFS.
The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode. For example:
thread1: going through a big range, stops in the middle of a vma and
stores the restart address in vm_truncate_count.
thread2: comes in with a small (e.g. single page) unmap request on
the same vma, somewhere before restart_address, finds that the
vma was already unmapped up to the restart address and happily
returns without doing anything.
Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value. This could go on forever without any of them being able to
finish.
Truncate and hole punching already serialize with i_mutex. Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers. In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.
This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.
[ We'll hopefully get rid of all this with the upcoming mm
preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
lockbreak" patch in particular. But that is for 2.6.39 ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Michael Leun <lkml20101129@newton.leun.net>
Reported-by: Gurudas Pai <gurudas.pai@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.
This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Right now we, are relying on the fact that when we attempt to
actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
and the user is informed that the device does not support discard.
However, in the case where the we do not hit any suitable free
extent to trim in FITRIM code, it will finish without any error.
This is very confusing, because it seems that FITRIM was successful
even though the device does not actually supports discard.
Solution: Check for the discard support before attempt to search for
free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to
xfs_fs_geometry() with a version number of 3. This code path does not
fill in the logsunit member of the passed xfs_fsop_geom_t, leading to
the leaking of four bytes of uninitialized stack data to potentially
unprivileged callers.
v2 switches to memset() to avoid future issues if structure members
change, on suggestion of Dave Chinner.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Reviewed-by: Eugene Teo <eugeneteo@kernel.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Copy up lower inode attrs in getattr
ecryptfs: read on a directory should return EISDIR if not supported
eCryptfs: Handle NULL nameidata pointers
eCryptfs: Revert "dont call lookup_one_len to avoid NULL nameidata"
LANMAN response length was changed to 16 bytes instead of 24 bytes.
Revert it back to 24 bytes.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
The lower filesystem may do some type of inode revalidation during a
getattr call. eCryptfs should take advantage of that by copying the
lower inode attributes to the eCryptfs inode after a call to
vfs_getattr() on the lower inode.
I originally wrote this fix while working on eCryptfs on nfsv3 support,
but discovered it also fixed an eCryptfs on ext4 nanosecond timestamp
bug that was reported.
https://bugs.launchpad.net/bugs/613873
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
read() calls against a file descriptor connected to a directory are
incorrectly returning EINVAL rather than EISDIR:
[EISDIR]
[XSI] [Option Start] The fildes argument refers to a directory and the
implementation does not allow the directory to be read using read()
or pread(). The readdir() function should be used instead. [Option End]
This occurs because we do not have a .read operation defined for
ecryptfs directories. Connect this up to generic_read_dir().
BugLink: http://bugs.launchpad.net/bugs/719691
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Commit 2c442719e9 added some checks for proper
heartbeat mode when the o2cb stack is running. Unfortunately, it didn't
take into account that a userpsace stack could be running. Fix this by only
doing the check if o2cb is in use. This patch allows userspace stacks to
mount the fs again.
Cc: stable@kernel.org
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Current refcounttree codes actually didn't writeback the new pages out in
write-back mode, due to a bug of always passing a ZERO number of clusters
to 'ocfs2_cow_sync_writeback', the patch tries to pass a proper one in.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <jlbec@evilplan.org>
In the rare case that INLINE_DATA, INDEX_DIR, QUOTA, XATTR features are
disabled and both the allocation of the directory inode and the allocation
of the first directory block need to relink allocation group, there need
not be enough credits reserved in a transaction. Fix the estimate.
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When creating a new dentry we now hold a reference to the parent
inode in the ceph_dentry. This is required due to the new RCU
changes from 949854d0, which set dentry->d_parent to NULL in d_kill before
calling the ->release() callback. If/when that behavior is changed, we can
revert this hack.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
* 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long
workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable'
workqueue: wake up a worker when a rescuer is leaving a gcwq
This reverts commit 21edad3220 and commit
93c3fe40c2, which fixed a regression by
the former.
Al Viro pointed out bypassed dcache lookups in
ecryptfs_new_lower_dentry(), misuse of vfs_path_lookup() in
ecryptfs_lookup_one_lower() and a dislike of passing nameidata to the
lower filesystem.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Validate number of blocks in map and remove redundant variable.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code finds, the '%' sign in an ipv6 address and copies that to a
buffer allocated on the stack. It then ignores that buffer, and passes
'pct' to simple_strtoul(), which doesn't work right because we're
comparing 'endp' against a completely different string.
Fix it by passing the correct pointer. While we're at it, this is a
good candidate for conversion to strict_strtoul as well.
Cc: stable@kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-by: Björn JACKE <bj@sernet.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>