Граф коммитов

8932 Коммитов

Автор SHA1 Сообщение Дата
Linus Torvalds 4f9faaace2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  rose: Wrong list_lock argument in rose_node seqops
  netns: Fix reassembly timer to use the right namespace
  netns: Fix device renaming for sysfs
  bnx2: Update version to 1.7.5.
  bnx2: Update RV2P firmware for 5709.
  bnx2: Zero out context memory for 5709.
  bnx2: Fix register test on 5709.
  bnx2: Fix remote PHY initial link state.
  bnx2: Refine remote PHY locking.
  bridge: forwarding table information for >256 devices
  tg3: Update version to 3.92
  tg3: Add link state reporting to UMP firmware
  tg3: Fix ethtool loopback test for 5761 BX devices
  tg3: Fix 5761 NVRAM sizes
  tg3: Use constant 500KHz MI clock on adapters with a CPMU
  hci_usb.h: fix hard-to-trigger race
  dccp: ccid2.c, ccid3.c use clamp(), clamp_t()
  net: remove NR_CPUS arrays in net/core/dev.c
  net: use get/put_unaligned_* helpers
  bluetooth: use get/put_unaligned_* helpers
  ...
2008-05-03 10:18:21 -07:00
Linus Torvalds be2e88011b Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: Use GFP_NOFS in kmalloc during localalloc window move
  ocfs2: Allow uid/gid/perm changes of symlinks
  ocfs2/dlm: dlmdebug.c: make 2 functions static
  ocfs2: make struct o2cb_stack_ops static
  ocfs2: make struct ocfs2_control_device static
  ocfs2: Correct merge of 52f7c21 (Move /sys/o2cb to /sys/fs/o2cb)
2008-05-02 13:53:07 -07:00
Linus Torvalds b66e1f11eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [PATCH] fix sysctl_nr_open bugs
  [PATCH] sanitize anon_inode_getfd()
  [PATCH] split linux/file.h
  [PATCH] make osf_select() use core_sys_select()
  [PATCH] remove horrors with irix tty ioctls handling
  [PATCH] fix file and descriptor handling in perfmon
2008-05-02 11:23:14 -07:00
Denis V. Lunev 78e92b99ec netns: assign PDE->data before gluing entry into /proc tree
In this unfortunate case, proc_mkdir_mode wrapper can't be used anymore and
this is no way to reuse proc_create_data due to nlinks assignment. So,
copy the code from proc_mkdir and assign PDE->data at the appropriate
moment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-02 04:12:41 -07:00
Linus Torvalds 2c4aabcca8 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  [MTD][NOR] Add physical address to point() method
  [JFFS2] Track parent inode for directories (for NFS export)
  [JFFS2] Invert last argument of jffs2_gc_fetch_inode(), make it boolean.
  [JFFS2] Quiet lockdep false positive.
  [JFFS2] Clean up jffs2_alloc_inode() and jffs2_i_init_once()
  [MTD] Delete long-unused jedec.h header file.
  [MTD] [NAND] at91_nand: use at91_nand_{en,dis}able consistently.
2008-05-01 11:15:28 -07:00
Jared Hulbert a98889f3d8 [MTD][NOR] Add physical address to point() method
Adding the ability to get a physical address from point() in addition
to virtual address.  This physical address is required for XIP of
userspace code from flash.

Signed-off-by: Jared Hulbert <jaredeh@gmail.com>
Reviewed-by: Jörn Engel <joern@logfs.org>
Acked-by: Nicolas Pitre <nico@cam.org>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-05-01 18:59:11 +01:00
David Woodhouse 27c72b040c [JFFS2] Track parent inode for directories (for NFS export)
To support NFS export, we need to know the parent inode of directories.
Rather than growing the jffs2_inode_cache structure, share space with
the nlink field -- which was always set to 1 for directories anyway.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-05-01 18:47:17 +01:00
Al Viro 5c598b3428 [PATCH] fix sysctl_nr_open bugs
* if luser with root sets it to something that is not a multiple of
  BITS_PER_LONG, the system is screwed.
* if it gets decreased at the wrong time, we can get expand_files()
  returning success and _not_ increasing the size of table as asked.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:57 -04:00
Al Viro 2030a42cec [PATCH] sanitize anon_inode_getfd()
a) none of the callers even looks at inode or file returned by anon_inode_getfd()
b) any caller that would try to look at those would be racy, since by the time
it returns we might have raced with close() from another thread and that
file would be pining for fjords.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:50 -04:00
Al Viro 9f3acc3140 [PATCH] split linux/file.h
Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:16 -04:00
Al Viro a2dcb44c3c [PATCH] make osf_select() use core_sys_select()
... instead of open-coding it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:07:28 -04:00
David Woodhouse 1b690b4878 [JFFS2] Invert last argument of jffs2_gc_fetch_inode(), make it boolean.
We don't actually care about nlink; we only care whether the inode in
question is unlinked or not.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-05-01 17:24:28 +01:00
Harvey Harrison bd7309677c fuse: use clamp() rather than nested min/max
clamp() exists for this use.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:02 -07:00
Jan Blunck 868eb7a853 autofs: path_{get,put}() cleanups
Here are some more places where path_{get,put}() can be used instead of
dput()/mntput() pair.  Besides that it fixes a bug in autofs4_mount_busy()
where mntput() was called before dput().

Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:01 -07:00
Jeff Moyer 9d2de6ad2a autofs4: fix incorrect return from root.c:try_to_fill_dentry()
Jeff Moyer has identified a case where the autofs4 function
root.c:try_to_fill_dentry() can return -EBUSY when it should return 0.

Jeff's description of the way this happens is:

"automount starts an expire for directory d.  after the callout to the daemon,
but before the rmdir, another process tries to walk into the same directory.
It puts itself onto the waitq, pending the expiration.

When the expire finishes, the second process is woken up.  In
try_to_fill_dentry, it does this check:

                status = d_invalidate(dentry);
                if (status != -EBUSY)
                        return -EAGAIN;

And status is EBUSY.  The dentry still has a non-zero d_inode, and the
flags do not contain LOOKUP_CONTINUE or LOOKUP_DIRECTORY

So, we fall through and return -EBUSY to the caller."

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:01 -07:00
Jeff Moyer 033790449b autofs4: fix execution order race in mount request code
Jeff Moyer has identified a race in due to an execution order dependency
in the autofs4 function root.c:try_to_fill_dentry().

Jeff's description of this race is:

"P1 does a lookup of /mount/submount/foo.  Since the VFS can't find an entry
for "foo" under /mount/submount, it calls into the autofs4 kernel module to
allocate a new dentry, D1.  The kernel creates a new waitq for this lookup and
calls the daemon to perform the mount.

The daemon performs a mkdir of the "foo" directory under /mount/submount,
which ends up creating a *new* dentry, D2.

Then, P2 does a lookup of /mount/submount/foo.  The VFS path walking logic
finds a dentry in the dcache, D2, and calls the revalidate function with this.
 In the autofs4 revalidate code, we then trigger a mount, since the dentry is
an empty directory that isn't a mountpoint, and so set DCACHE_AUTOFS_PENDING
and call into the wait code to trigger the mount.

The wait code finds our existing waitq entry (since it is keyed off of the
directory name) and adds itself to the list of waiters.

After the daemon finishes the mount, it calls back into the kernel to release
the waiters.  When this happens, P1 is woken up and goes about clearing the
DCACHE_AUTOFS_PENDING flag, but it does this in D1!  So, given that P1 in our
case is a program that will immediately try to access a file under
/mount/submount/foo, we end up finding the dentry D2 which still has the
pending flag set, and we set out to wait for a mount *again*!

So, one way to address this is to re-do the lookup at the end of
try_to_fill_dentry, and to clear the pending flag on the hashed dentry.  This
seems a sane approach to me."

And Jeff's patch does this.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:01 -07:00
Ian Kent cab0936aac autofs4: check for invalid dentry in getpath
Catch invalid dentry when calculating its path.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:01 -07:00
Ian Kent afec570c32 autofs4: fix sparse warning in waitq.c:autofs4_expire_indirect()
Re-order some code in expire.c:autofs4_expire_indirect() to avoid compile
warning, reported by Harvey Harrison:

 CHECK   fs/autofs4/expire.c
fs/autofs4/expire.c:383:2: warning: context imbalance in
'autofs4_expire_indirect' - unexpected unlock

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:04:01 -07:00
Miklos Szeredi 02c6be615f vfs: fix permission checking in sys_utimensat
If utimensat() is called with both times set to UTIME_NOW or one of them to
UTIME_NOW and the other to UTIME_OMIT, then it will update the file time
without any permission checking.

I don't think this can be used for anything other than a local DoS, but could
be quite bewildering at that (e.g.  "Why was that large source tree rebuilt
when I didn't modify anything???")

This affects all kernels from 2.6.22, when the utimensat() syscall was
introduced.

Fix by doing the same permission checking as for the "times == NULL" case.

Thanks to Michael Kerrisk, whose utimensat-non-conformances-and-fixes.patch in
-mm also fixes this (and breaks other stuff), only he didn't realize the
security implications of this bug.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:03:59 -07:00
David Woodhouse 590fe34c47 [JFFS2] Quiet lockdep false positive.
Don't hold f->sem while calling into jffs2_do_create(). It makes lockdep
unhappy, and we don't really need it -- the _reason_ it's a false
positive is because nobody else can see this inode yet and so nobody
will be trying to lock it anyway.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-05-01 15:53:28 +01:00
David Woodhouse 4e571aba7b [JFFS2] Clean up jffs2_alloc_inode() and jffs2_i_init_once()
Ditch a couple of pointless casts from void *, and use the normal
variable name 'f' for jffs2_inode_info pointers -- especially since
it actually shows up in lockdep reports.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-05-01 12:29:37 +01:00
Al Viro 214b7049a7 Fix dnotify/close race
We have a race between fcntl() and close() that can lead to
dnotify_struct inserted into inode's list *after* the last descriptor
had been gone from current->files.

Since that's the only point where dnotify_struct gets evicted, we are
screwed - it will stick around indefinitely.  Even after struct file in
question is gone and freed.  Worse, we can trigger send_sigio() on it at
any later point, which allows to send an arbitrary signal to arbitrary
process if we manage to apply enough memory pressure to get the page
that used to host that struct file and fill it with the right pattern...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 20:09:00 -07:00
Sunil Mushran 4ba1c5bfd2 ocfs2: Use GFP_NOFS in kmalloc during localalloc window move
kmalloc() during a localalloc window move can trigger the mm to prune
the dcache which inturn can trigger the fs to delete an inode causing
it start a recursive transaction.

The fix also makes the change in kmalloc during localalloc shutdown
just to be safe.

Fixes oss bugzilla#901
http://oss.oracle.com/bugzilla/show_bug.cgi?id=901

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:09:58 -07:00
Sunil Mushran bc535809c0 ocfs2: Allow uid/gid/perm changes of symlinks
This patch adds the ability to change attributes of a symlink.
Fixes oss bugzilla#963
http://oss.oracle.com/bugzilla/show_bug.cgi?id=963

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:09:54 -07:00
Adrian Bunk 95642e5664 ocfs2/dlm: dlmdebug.c: make 2 functions static
This patch makes the following needlessly global functions static:
- stringify_lockname()
- dlm_debug_put()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:09:40 -07:00
Adrian Bunk 4af694e672 ocfs2: make struct o2cb_stack_ops static
This patch makes the needlessly global struct o2cb_stack_ops static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:09:25 -07:00
Adrian Bunk 4d8755b5e6 ocfs2: make struct ocfs2_control_device static
This patch makes the needlessly global struct ocfs2_control_device
static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:09:08 -07:00
Joel Becker 9d80f7539a ocfs2: Correct merge of 52f7c21 (Move /sys/o2cb to /sys/fs/o2cb)
Commit 52f7c21b61 was intended to move
/sys/o2cb to /sys/fs/o2cb, providing /sys/o2cb as a symlink for
backwards compatibility.  However, the merge apparently added the
symlink but failed to move the directory, resulting in a duplicate
filename error.  It's a one-line change that was missing.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-30 17:07:59 -07:00
Robert P. J. Day 883ce42ec4 DEBUGFS: Correct location of debugfs API documentation.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-30 16:52:47 -07:00
Ben Hutchings 40a2159abf sysfs: Disallow truncation of files in sysfs
sysfs allows attribute files to be truncated, e.g. using ftruncate(), with the
expected effect on their inode.   For most attributes, this doesn't change the
"real" size of the file i.e. how much can be read from it.  However, the
parameter validation for reading and writing binary attribute files is based
on the inode size and not the size specified in the file's bin_attribute, so it
can be broken by this. For example, if we try using dd to write to such a file:

# pwd
/sys/bus/pci/devices/0000:08:00.0
# ls -l config
-rw-r--r--  1 root root 4096 Feb  1 17:35 config
# dd if=/dev/zero of=config bs=4 count=1
1+0 records in
1+0 records out
# ls -l config
-rw-r--r--  1 root root 0 Feb  1 17:50 config
# dd if=/dev/zero of=config bs=4 count=1 seek=128
dd: writing `config': No space left on device
1+0 records in
0+0 records out

Also, after truncation to 0, parameter validation for read and write is
disabled.  Most bin_attribute read and write methods also validate the size and
offset, but for some this will allow out-of-range access.  This may be a
security issue, though access to such files is often limited to root.  In any
case, the validation should remain for safety's sake!)

This was previously reported in Bugzilla as bug 9867.

sysfs should ignore size changes or else refuse them (by returning -EINVAL).
This patch makes it ignore them.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-30 16:52:46 -07:00
Linus Torvalds d67c6f869c Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [S390] Update default configuration.
  [S390] use generic sys_ptrace
  [S390] Remove self ptrace IEEE_IP hack.
  [S390] Convert to SPARSEMEM & SPARSEMEM_VMEMMAP
  [S390] System z large page support.
  [S390] Convert machine feature detection code to C.
  [S390] vmemmap: use clear_table to initialise page tables.
  [S390] Move stfl to system.h and delete duplicated version.
  [S390] uaccess_mvcos: #ifdef config dependent code.
  [S390] cpu topology: Fix possible deadlock.
  [S390] Add topology_core_siblings to topology.h
  [S390] cio: Make isc handling more robust.
  [S390] remove -traditional
  [S390] Automatically detect added cpus.
  [S390] smp: Fix locking order.
  [S390] Add missing ifndef/define to include/asm-s390/sysinfo.h.
  [S390] Move show_regs to traps.c.
  [S390] cio: Use strict_strtoul() for attributes.
2008-04-30 08:38:30 -07:00
Harvey Harrison 8e24eea728 fs: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:54 -07:00
Harvey Harrison 530b641278 afs: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:54 -07:00
Thomas Gleixner c6f3a97f86 debugobjects: add timer specific object debugging code
Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:53 -07:00
Andrew Morton 487798df6d hfsplus: fix warning with 64k PAGE_SIZE
fs/hfsplus/btree.c: In function 'hfsplus_bmap_alloc':
fs/hfsplus/btree.c:239: warning: comparison is always false due to limited range of data type

But this might hide a real bug?

Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:52 -07:00
Andrew Morton 3e5a509730 hfs: fix warning with 64k PAGE_SIZE
fs/hfs/btree.c: In function 'hfs_bmap_alloc':
fs/hfs/btree.c:263: warning: comparison is always false due to limited range of data type

The patch makes the warning go away, but the code might actually be buggy?

Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:52 -07:00
Marcin Slusarz 07132922aa sysv: [bl]e*_add_cpu conversion
replace all:
big/little_endian_variable = cpu_to_[bl]eX([bl]eX_to_cpu(big/little_endian_variable) +
					expression_in_cpu_byteorder);
with:
	[bl]eX_add_cpu(&big/little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:52 -07:00
Marcin Slusarz e3592b12f5 quota: le*_add_cpu conversion
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
					expression_in_cpu_byteorder);
with:
	leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Marcin Slusarz 20c79e785a hfs/hfsplus: be*_add_cpu conversion
replace all:
big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) +
					expression_in_cpu_byteorder);
with:
	beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Marcin Slusarz 6369a4abb4 affs: be*_add_cpu conversion
replace all:
big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) +
					expression_in_cpu_byteorder);
with:
	beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Christoph Hellwig 86098fa011 reiserfs: use open_bdev_excl
Use the proper helper to open a blockdevice by name for filesystem use,
this makes sure it's properly claimed (also added for open-by-number) and
gets rid of the struct file abuse.

Tested by mounting a reiserfs filesystem with external journal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Acked-by: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Miklos Szeredi 4dbf930ed6 fuse: fix sparse warnings
fs/fuse/dev.c:306:2: warning: context imbalance in 'wait_answer_interruptible' - unexpected unlock
fs/fuse/dev.c:361:2: warning: context imbalance in 'request_wait_answer' - unexpected unlock
fs/fuse/dev.c:1002:4: warning: context imbalance in 'end_io_requests' - unexpected unlock

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Miklos Szeredi 5559b8f4d1 fuse: fix race in llseek
Fuse doesn't use i_mutex to protect setting i_size, and so
generic_file_llseek() can be racy: it doesn't use i_size_read().

So do a fuse specific llseek method, which does use i_size_read().

[akpm@linux-foundation.org: make `retval' loff_t]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Miklos Szeredi b48badf013 fuse: fix node ID type
Node ID is 64bit but it is passed as unsigned long to some functions.  This
breakage wasn't noticed, because libfuse uses unsigned long too.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Miklos Szeredi e5d9a0df07 fuse: fix max i/o size calculation
Fix a bug that Werner Baumann reported: fuse can send a bigger write request
than the maximum specified.  This only affected direct_io operation.

In addition set a sane minimum for the max_read and max_write tunables, so I/O
always makes some progress.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:51 -07:00
Miklos Szeredi 5c5c5e51b2 fuse: update file size on short read
If the READ request returned a short count, then either

  - cached size is incorrect
  - filesystem is buggy, as short reads are only allowed on EOF

So assume that the size is wrong and refresh it, so that cached read() doesn't
zero fill the missing chunk.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00
Nick Piggin ea9b9907b8 fuse: implement perform_write
Introduce fuse_perform_write.  With fusexmp (a passthrough filesystem), large
(1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
(256MB/s vs 71MB/s).

[mszeredi@suse.cz]:

 - split into smaller functions
 - testing
 - duplicate generic_file_aio_write(), so that there's no need to add a
   new ->perform_write() a_op.  Comment from hch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00
Miklos Szeredi 854512ec35 fuse: clean up setting i_size in write
Extract common code for setting i_size in write functions into a common
helper.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00
Miklos Szeredi 3be5a52b30 fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):

  "User-space filesystems are hard to get right. I'd claim that they
   are almost impossible, unless you limit them somehow (shared
   writable mappings are the nastiest part - if you don't have those,
   you can reasonably limit your problems by limiting the number of
   dirty pages you accept through normal "write()" calls)."

Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others).  This nicely solved the biggest problem: limiting the number of pages
used for write caching.

Some small details remained, however, which this largish patch attempts to
address.  It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks.  Performance may not be very
good for certain usage patterns, but generally it should be acceptable.

It has been tested extensively with fsx-linux and bash-shared-mapping.

Fuse page writeback design
--------------------------

fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.

The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.

For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented.  The per-bdi writeback count is not decremented until the actual
write completes.

On dirtying the page, fuse waits for a previous write to finish before
proceeding.  This makes sure, there can only be one temporary page used at a
time for one cached page.

This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?

The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write.  It may be buggy or even
malicious, and fail to complete WRITE requests.  We don't want unrelated parts
of the system to grind to a halt in such cases.

Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request.  There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.

Currently there are several cases where the kernel can block on page
writeback:

  - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
  - page migration
  - throttle_vm_writeout (through NR_WRITEBACK)
  - sync(2)

Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.

As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default.  This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.

With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00
Miklos Szeredi fc3ba692a4 mm: Add NR_WRITEBACK_TEMP counter
Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously).  This is needed, because there cannot
be any guarantee about the time in which a write will complete.

By using temporary buffers, from the MM's point if view the page is written
back immediately.  If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.

This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.

[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00