Граф коммитов

13218 Коммитов

Автор SHA1 Сообщение Дата
Trond Myklebust b1e4adf4ea NFS: Fix the notifications when renaming onto an existing file
NFS appears to be returning an unnecessary "delete" notification when
we're doing an atomic rename. See

  http://bugzilla.gnome.org/show_bug.cgi?id=575684

The fix is to get rid of the redundant call to d_delete().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-19 15:35:49 -04:00
Trond Myklebust 47c6256420 NFS: Fix up a mismerged patch
Move the definition of nfs_need_commit() into the #ifdef CONFIG_NFS_V3
section as originally intended in the patch "NFS: cleanup - remove
struct nfs_inode->ncommit"

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-19 15:17:40 -04:00
Linus Torvalds a8e7d49aa7 Fix race in create_empty_buffers() vs __set_page_dirty_buffers()
Nick Piggin noticed this (very unlikely) race between setting a page
dirty and creating the buffers for it - we need to hold the mapping
private_lock until we've set the page dirty bit in order to make sure
that create_empty_buffers() might not build up a set of buffers without
the dirty bits set when the page is dirty.

I doubt anybody has ever hit this race (and it didn't solve the issue
Nick was looking at), but as Nick says: "Still, it does appear to solve
a real race, which we should close."

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-19 11:32:05 -07:00
Linus Torvalds 85bff8857c Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux:
  nfsd: nfsd should drop CAP_MKNOD for non-root
  NFSD: provide encode routine for OP_OPENATTR
2009-03-18 09:27:20 -07:00
Steve French b363b3304b [CIFS] Fix memory overwrite when saving nativeFileSystem field during mount
CIFS can allocate a few bytes to little for the nativeFileSystem field
during tree connect response processing during mount.  This can result
in a "Redzone overwritten" message to be logged.

Signed-off-by: Sridhar Vinay <vinaysridhar@in.ibm.com>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-18 05:57:22 +00:00
Steve French c6c00919ab [CIFS] Rename compose_mount_options to cifs_compose_mount_options.
Make it available to others for reuse.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-18 05:50:07 +00:00
Linus Torvalds 58cefd2b1e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix bb_prealloc_list corruption due to wrong group locking
  ext4: fix bogus BUG_ONs in in mballoc code
  ext4: Print the find_group_flex() warning only once
  ext4: fix header check in ext4_ext_search_right() for deep extent trees.
2009-03-17 20:55:40 -07:00
Benny Halevy 84f09f46b4 NFSD: provide encode routine for OP_OPENATTR
Although this operation is unsupported by our implementation
we still need to provide an encode routine for it to
merely encode its (error) status back in the compound reply.

Thanks for Bill Baker at sun.com for testing with the Sun
OpenSolaris' client, finding, and reporting this bug at
Connectathon 2009.

This bug was introduced in 2.6.27

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-17 14:54:45 -04:00
Linus Torvalds ee568b25ee Avoid 64-bit "switch()" statements on 32-bit architectures
Commit ee6f779b9e ("filp->f_pos not
correctly updated in proc_task_readdir") changed the proc code to use
filp->f_pos directly, rather than through a temporary variable.  In the
process, that caused the operations to be done on the full 64 bits, even
though the offset is never that big.

That's all fine and dandy per se, but for some unfathomable reason gcc
generates absolutely horrid code when using 64-bit values in switch()
statements.  To the point of actually calling out to gcc helper
functions like __cmpdi2 rather than just doing the trivial comparisons
directly the way gcc does for normal compares.  At which point we get
link failures, because we really don't want to support that kind of
crazy code.

Fix this by just casting the f_pos value to "unsigned long", which
is plenty big enough for /proc, and avoids the gcc code generation issue.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zhang Le <r0bertz@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-17 10:02:35 -07:00
Eric Sandeen d33a1976fb ext4: fix bb_prealloc_list corruption due to wrong group locking
This is for Red Hat bug 490026: EXT4 panic, list corruption in
ext4_mb_new_inode_pa

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow to remove an album is like
this:

    ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
    ext4_lock_group(sb, grp);
    list_del(&pa->pa_group_list);
    ext4_unlock_group(sb, grp);

so it's critical that we get the right group number back for
this prealloc context, to lock the right group (the one 
associated with this pa) and prevent concurrent list manipulation.

however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
comment, "-1 is to protect from crossing allocation group".

This makes sense for the group_pa, where pa_pstart is advanced
by the length which has been used (in ext4_mb_release_context()),
and when the entire length has been used, pa_pstart has been
advanced to the first block of the next group.

However, for inode_pa, pa_pstart is never advanced; it's just
set once to the first block in the group and not moved after
that.  So in this case, if we subtract one in ext4_mb_put_pa(),
we are actually locking the *previous* group, and opening the
race with the other threads which do not subtract off the extra
block.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-16 23:25:40 -04:00
Theodore Ts'o afd4672dc7 ext4: Add auto_da_alloc mount option
Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file.  This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code.  This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-16 23:12:23 -04:00
Zhang Le ee6f779b9e filp->f_pos not correctly updated in proc_task_readdir
filp->f_pos only get updated at the end of the function. Thus d_off of those
dirents who are in the middle will be 0, and this will cause a problem in
glibc's readdir implementation, specifically endless loop. Because when overflow
occurs, f_pos will be set to next dirent to read, however it will be 0, unless
the next one is the last one. So it will start over again and again.

There is a sample program in man 2 gendents. This is the output of the program
running on a multithread program's task dir before this patch is applied:

  $ ./a.out /proc/3807/task
  --------------- nread=128 ---------------
  i-node#  file type  d_reclen  d_off   d_name
    506442  directory    16          1  .
    506441  directory    16          0  ..
    506443  directory    16          0  3807
    506444  directory    16          0  3809
    506445  directory    16          0  3812
    506446  directory    16          0  3861
    506447  directory    16          0  3862
    506448  directory    16          8  3863

This is the output after this patch is applied

  $ ./a.out /proc/3807/task
  --------------- nread=128 ---------------
  i-node#  file type  d_reclen  d_off   d_name
    506442  directory    16          1  .
    506441  directory    16          2  ..
    506443  directory    16          3  3807
    506444  directory    16          4  3809
    506445  directory    16          5  3812
    506446  directory    16          6  3861
    506447  directory    16          7  3862
    506448  directory    16          8  3863

Signed-off-by: Zhang Le <r0bertz@gentoo.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-16 07:51:33 -07:00
Jonathan Corbet 60aa49243d Rationalize fasync return values
Most fasync implementations do something like:

     return fasync_helper(...);

But fasync_helper() will return a positive value at times - a feature used
in at least one place.  Thus, a number of other drivers do:

     err = fasync_helper(...);
     if (err < 0)
             return err;
     return 0;

In the interests of consistency and more concise code, it makes sense to
map positive return values onto zero where ->fasync() is called.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-03-16 08:34:35 -06:00
Jonathan Corbet 76398425bb Move FASYNC bit handling to f_op->fasync()
Removing the BKL from FASYNC handling ran into the challenge of keeping the
setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
the underlying fasync() function.  Andi Kleen suggested moving the handling
of that bit into fasync(); this patch does exactly that.  As a result, we
have a couple of internal API changes: fasync() must now manage the FASYNC
bit, and it will be called without the BKL held.

As it happens, every fasync() implementation in the kernel with one
exception calls fasync_helper().  So, if we make fasync_helper() set the
FASYNC bit, we can avoid making any changes to the other fasync()
functions - as long as those functions, themselves, have proper locking.
Most fasync() implementations do nothing but call fasync_helper() - which
has its own lock - so they are easily verified as correct.  The BKL had
already been pushed down into the rest.

The networking code has its own version of fasync_helper(), so that code
has been augmented with explicit FASYNC bit handling.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-03-16 08:32:27 -06:00
Jonathan Corbet db1dd4d376 Use f_lock to protect f_flags
Traditionally, changes to struct file->f_flags have been done under BKL
protection, or with no protection at all.  This patch causes all f_flags
changes after file open/creation time to be done under protection of
f_lock.  This allows the removal of some BKL usage and fixes a number of
longstanding (if microscopic) races.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-03-16 08:32:27 -06:00
Jonathan Corbet 6849991490 Rename struct file->f_ep_lock
This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock.  For now,
epoll remains the only user, but a future patch will use it to protect
f_flags as well.

Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-03-16 08:32:27 -06:00
Linus Torvalds 18553c38bc Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix Xilinx SystemACE driver to handle empty CF slot
  block: fix memory leak in bio_clone()
  block: Add gfp_mask parameter to bio_integrity_clone()
2009-03-14 13:43:18 -07:00
Li Zefan 059ea3318c block: fix memory leak in bio_clone()
If bio_integrity_clone() fails, bio_clone() returns NULL without freeing
the newly allocated bio.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-14 21:06:52 +01:00
un'ichi Nomura 87092698c6 block: Add gfp_mask parameter to bio_integrity_clone()
Stricter gfp_mask might be required for clone allocation.
For example, request-based dm may clone bio in interrupt context
so it has to use GFP_ATOMIC.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-14 21:06:51 +01:00
Linus Torvalds f1823acfbc Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined...
  SUNRPC: xprt_connect() don't abort the task if the transport isn't bound
  SUNRPC: Fix an Oops due to socket not set up yet...
  Bug 11061, NFS mounts dropped
  NFS: Handle -ESTALE error in access()
  NLM: Fix GRANT callback address comparison when IPv6 is enabled
  NLM: Shrink the IPv4-only version of nlm_cmp_addr()
  NFSv3: Fix posix ACL code
  NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2)
  SUNRPC: Tighten up the task locking rules in __rpc_execute()
2009-03-14 12:00:18 -07:00
Linus Torvalds ff9cb43ce0 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: Use xs->bucket to set xattr value outside
  ocfs2: Fix a bug found by sparse check.
  ocfs2: tweak to get the maximum inline data size with xattr
  ocfs2: reserve xattr block for new directory with inline data
2009-03-14 11:59:22 -07:00
Tyler Hicks 84814d642a eCryptfs: don't encrypt file key with filename key
eCryptfs has file encryption keys (FEK), file encryption key encryption
keys (FEKEK), and filename encryption keys (FNEK).  The per-file FEK is
encrypted with one or more FEKEKs and stored in the header of the
encrypted file.  I noticed that the FEK is also being encrypted by the
FNEK.  This is a problem if a user wants to use a different FNEK than
their FEKEK, as their file contents will still be accessible with the
FNEK.

This is a minimalistic patch which prevents the FNEKs signatures from
being copied to the inode signatures list.  Ultimately, it keeps the FEK
from being encrypted with a FNEK.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-14 11:57:22 -07:00
Johannes Weiner 15e7b87676 nommu: ramfs: don't leak pages when adding to page cache fails
When a ramfs nommu mapping is expanded, contiguous pages are allocated
and added to the pagecache.  The caller's reference is then passed on
by moving whole pagevecs to the file lru list.

If the page cache adding fails, make sure that the error path also
moves the pagevec contents which might still contain up to PAGEVEC_SIZE
successfully added pages, of which we would leak references otherwise.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-14 11:57:22 -07:00
Enrik Berkhan 020fe22ff1 nommu: ramfs: pages allocated to an inode's pagecache may get wrongly discarded
The pages attached to a ramfs inode's pagecache by truncation from nothing
- as done by SYSV SHM for example - may get discarded under memory
pressure.

The problem is that the pages are not marked dirty.  Anything that creates
data in an MMU-based ramfs will cause the pages holding that data will
cause the set_page_dirty() aop to be called.

For the NOMMU-based mmap, set_page_dirty() may be called by write(), but
it won't be called by page-writing faults on writable mmaps, and it isn't
called by ramfs_nommu_expand_for_mapping() when a file is being truncated
from nothing to allocate a contiguous run.

The solution is to mark the pages dirty at the point of allocation by the
truncation code.

Signed-off-by: Enrik Berkhan <Enrik.Berkhan@ge.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-14 11:57:22 -07:00
Eric Sandeen 8d03c7a0c5 ext4: fix bogus BUG_ONs in in mballoc code
Thiemo Nagel reported that:

# dd if=/dev/zero of=image.ext4 bs=1M count=2
# mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
  -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
# mount -o loop image.ext4 mnt/
# dd if=/dev/zero of=mnt/file

oopsed, with a BUG_ON in ext4_mb_normalize_request because
size == EXT4_BLOCKS_PER_GROUP

It appears to me (esp. after talking to Andreas) that the BUG_ON
is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
be allowed, though larger sizes do indicate a problem.

Fix that an another (apparently rare) codepath with a similar check.

Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-14 11:51:46 -04:00
Tao Ma 712e53e46a ocfs2: Use xs->bucket to set xattr value outside
A long time ago, xs->base is allocated a 4K size and all the contents
in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to
abstract xattr bucket and xs->base is initialized to the start of the
bu_bhs[0]. So xs->base + offset will overflow when the value root is
stored outside the first block.

Then why we can survive the xattr test by now? It is because we always
read the bucket contiguously now and kernel mm allocate continguous
memory for us. We are lucky, but we should fix it. So just get the
right value root as other callers do.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-03-12 16:46:09 -07:00
Tao Ma 74e77eb30d ocfs2: Fix a bug found by sparse check.
We need to use le32_to_cpu to test rec->e_cpos in
ocfs2_dinode_insert_check.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-03-12 16:46:01 -07:00
Tiger Yang d9ae49d6e2 ocfs2: tweak to get the maximum inline data size with xattr
Replace max_inline_data with max_inline_data_with_xattr
to ensure it correct when xattr inlined.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-03-12 16:45:46 -07:00
Tiger Yang 6c9fd1dc0a ocfs2: reserve xattr block for new directory with inline data
If this is a new directory with inline data, we choose to
reserve the entire inline area for directory contents and
force an external xattr block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-03-12 16:45:40 -07:00
Linus Torvalds 188de5ec56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch
2009-03-12 16:32:36 -07:00
Nick Piggin 7ef0d7377c fs: new inode i_state corruption fix
There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121.  There is a script included to
reproduce the problem.

During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem.  I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode.  This points to memory scribble, or
synchronisation problme.

i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared.  Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW.  Subsequently modifying i_state causes corruption.  In
my case it would look like this:

CPU0                            CPU1
unlock_new_inode()              __sync_single_inode()
 reg <- inode->i_state
 reg -> reg & ~(I_LOCK|I_NEW)   reg <- inode->i_state
 reg -> inode->i_state          reg -> reg | I_SYNC
                                reg -> inode->i_state

Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.

Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.

After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running.  Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.

I'm also testing on ext3 now, and so far no problems there either.  I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.

Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-12 16:20:24 -07:00
Li Zefan a3cfbb53b1 vfs: add missing unlock in sget()
In sget(), destroy_super(s) is called with s->s_umount held, which makes
lockdep unhappy.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-12 16:20:23 -07:00
Oleg Nesterov e5bc49ba74 pipe_rdwr_fasync: fix the error handling to prevent the leak/crash
If the second fasync_helper() fails, pipe_rdwr_fasync() returns the error
but leaves the file on ->fasync_readers.

This was always wrong, but since 233e70f422
"saner FASYNC handling on file close" we have the new problem.  Because in
this case setfl() doesn't set FASYNC bit, __fput() will not do
->fasync(0), and we leak fasync_struct with ->fa_file pointing to the
freed file.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-12 16:20:23 -07:00
Trond Myklebust 9f4c899c0d NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined...
Stephen Rothwell reports:

Today's linux-next build (powerpc ppc64_defconfig) failed like this:

fs/built-in.o: In function `.nfs_get_client':
client.c:(.text+0x115010): undefined reference to `.__ipv6_addr_type'

Fix by moving the IPV6 specific parts of commit
d7371c41b0 ("Bug 11061, NFS mounts dropped")
into the '#ifdef IPV6..." section.

Also fix up a couple of formatting issues.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-12 14:51:32 -04:00
Theodore Ts'o 2842c3b544 ext4: Print the find_group_flex() warning only once
This is a short-term warning, and even printk_ratelimit() can result
in too much noise in system logs.  So only print it once as a warning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-12 12:20:01 -04:00
Phillip Lougher 363911d027 Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch
The corrupted filesystem patch added a check against zlib trying to
output too much data in the presence of data corruption.  This check
triggered if zlib_inflate asked to be called again (Z_OK) with
avail_out == 0 and no more output buffers available.  This check proves
to be rather dumb, as it incorrectly catches the case where zlib has
generated all the output, but there are still input bytes to be processed.

This patch does a number of things.  It removes the original check and
replaces it with code to not move to the next output buffer if there
are no more output buffers available, relying on zlib to error if it
wants an extra output buffer in the case of data corruption.  It
also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH
flag, and makes the error messages more understandable to
non-technical users.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reported-by: Stefan Lippers-Hollmann <s.L-H@gmx.de>
2009-03-12 03:23:48 +00:00
Steve French 64cc2c6369 [CIFS] work around bug in Samba server handling for posix open
Samba server (version 3.3.1 and earlier, and 3.2.8 and earlier) incorrectly
required the O_CREAT flag on posix open (even when a file was not being
created).  This disables posix open (create is still ok) after the first
attempt returns EINVAL (and logs an error, once, recommending that they
update their server).

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:21 +00:00
Steve French 276a74a483 [CIFS] Use posix open on file open when server supports it
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:21 +00:00
Jeff Layton fcc7c09d94 cifs: fix buffer format byte on NT Rename/hardlink
Discovered at Connnectathon 2009...

The buffer format byte and the pad are transposed in NT_RENAME calls
(which are used to set hardlinks). Most servers seem to ignore this
fact, but NetApp filers throw back an error due to this problem. This
patch fixes it.

CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:21 +00:00
Steve French 0382457744 [CIFS] Add definitions for remoteably fsctl calls
There are about 60 fsctl calls which Windows claims would be able
to be sent remotely and handled by the server. This adds the #defines
for them.  A few of them look immediately useful, but need to also
add the structure definitions for them so they can be sent as SMBs.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:21 +00:00
Steve French 1adcb71092 [CIFS] add extra null attr check
Although attr == NULL can not happen, this makes cifs_set_file_info safer
in the future since it may not be obvious that the caller can not set
attr to NULL.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:21 +00:00
Steve French 4717bed680 [CIFS] fix build error
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:20 +00:00
Steve French 7fc8f4e95b [CIFS] reopen file via newer posix open protocol operation if available
If the network connection crashes, and we have to reopen files, preferentially
use the newer cifs posix open protocol operation if the server supports it.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:20 +00:00
Steve French be652445fd [CIFS] Add new nostrictsync cifs mount option to avoid slow SMB flush
If this mount option is set, when an application does an
fsync call then the cifs client does not send an SMB Flush
to the server (to force the server to write all dirty data
for this file immediately to disk), although cifs still sends
all dirty (cached) file data to the server and waits for the
server to respond to the write write.  Since SMB Flush can be
very slow, and some servers may be reliable enough (to risk
delaying slightly flushing the data to disk on the server),
turning on this option may be useful to improve performance for
applications that fsync too much, at a small risk of server
crash.  If this mount option is not set, by default cifs will
send an SMB flush request (and wait for a response) on every
fsync call.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:20 +00:00
Steve French 10e70afa75 [CIFS] DFS no longer experimental
Also updates some DFS flag definitions

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:20 +00:00
Steve French b298f22355 [CIFS] Send SMB flush in cifs_fsync
In contrast to the now-obsolete smbfs, cifs does not send SMB_COM_FLUSH
in response to an explicit fsync(2) to guarantee that all volatile data
is written to stable storage on the server side, provided the server
honors the request (which, to my knowledge, is true for Windows and
Samba with 'strict sync' enabled).
This patch modifies the cifs_fsync implementation to restore the
fsync-behavior of smbfs by triggering SMB_COM_FLUSH after sending
outstanding data on the client side to the server.

Signed-off-by: Horst Reiterer <horst.reiterer@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-03-12 01:36:20 +00:00
Linus Torvalds 0789d8fccb Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: only issues a cache flush on unmount if barriers are enabled
  xfs: prevent lockdep false positive in xfs_iget_cache_miss
  xfs: prevent kernel crash due to corrupted inode log format
2009-03-11 14:29:03 -07:00
OGAWA Hirofumi 3a95ea1155 Fix _fat_bmap() locking
On swapon() path, it has already i_mutex. So, this uses i_alloc_sem
instead of it.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Laurent GUERBY <laurent@guerby.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-11 12:04:18 -07:00
Tom Talpey a67d18f89f NFS: load the rpc/rdma transport module automatically
When mounting an NFS/RDMA server with the "-o proto=rdma" or
"-o rdma" options, attempt to dynamically load the necessary
"xprtrdma" client transport module. Doing so improves usability,
while avoiding a static module dependency and any unnecesary
resources.

Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:37:56 -04:00
Trond Myklebust e1ebfd33be NFS: Kill the "defined but not used" compile error on nommu machines
Bryan Wu reports that when compiling NFS on nommu machines he gets a
"defined but not used" error on nfs_file_mmap().

The easiest fix is simply to get rid of the special casing in NFS, and
just always call generic_file_mmap() to set up the file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:37:54 -04:00
Trond Myklebust 72cb77f4a5 NFS: Throttle page dirtying while we're flushing to disk
The following patch is a combination of a patch by myself and Peter
Staubach.

Trond: If we allow other processes to dirty pages while a process is doing
a consistency sync to disk, we can end up never making progress.

Peter: Attached is a patch which addresses a continuing problem with
the NFS client generating out of order WRITE requests.  While
this is compliant with all of the current protocol
specifications, there are servers in the market which can not
handle out of order WRITE requests very well.  Also, this may
lead to sub-optimal block allocations in the underlying file
system on the server.  This may cause the read throughputs to
be reduced when reading the file from the server.

Peter: There has been a lot of work recently done to address out of
order issues on a systemic level.  However, the NFS client is
still susceptible to the problem.  Out of order WRITE
requests can occur when pdflush is in the middle of writing
out pages while the process dirtying the pages calls
generic_file_buffered_write which calls
generic_perform_write which calls
balance_dirty_pages_rate_limited which ends up calling
writeback_inodes which ends up calling back into the NFS
client to writes out dirty pages for the same file that
pdflush happens to be working with.

Signed-off-by: Peter Staubach <staubach@redhat.com>
[modification by Trond to merge the two similar patches]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:30 -04:00
Trond Myklebust fb8a1f11b6 NFS: cleanup - remove struct nfs_inode->ncommit
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:29 -04:00
Trond Myklebust a65318bf3a NFSv4: Simplify some cache consistency post-op GETATTRs
Certain asynchronous operations such as write() do not expect
(or care) that other metadata such as the file owner, mode, acls, ...
change. All they want to do is update and/or check the change attribute,
ctime, and mtime.
By skipping the file owner and group update, we also avoid having to do a
potential idmapper upcall for these asynchronous RPC calls.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:28 -04:00
Trond Myklebust 69aaaae18f NFSv4: A referral is assumed to always point to a directory.
Fix a bug whereby we would fail to create a mount point for a referral.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:28 -04:00
Trond Myklebust 409924e4c9 NFSv4: Make decode_getfattr() set fattr->valid to reflect what was decoded
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:27 -04:00
Trond Myklebust f26c7a7887 NFSv4: Clean up decode_getfattr()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:26 -04:00
Trond Myklebust bca794785c NFS: Fix the type of struct nfs_fattr->mode
There is no point in using anything other than umode_t, since we copy the
content pretty much directly into inode->i_mode.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:26 -04:00
Trond Myklebust 1ca277d88d NFS: Shrink the struct nfs_fattr
We don't need the bitmap[] field anymore, since the 'valid' field tells us
all we need to know about which attributes were filled in...
Also move the pre-op attributes in order to improve the structure packing.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:25 -04:00
Trond Myklebust 9e6e70f8d8 NFSv4: Support NFSv4 optional attributes in the struct nfs_fattr
Currently, filling struct nfs_fattr is more or less an all or nothing
operation, since NFSv2 and NFSv3 have only mandatory attributes.
In NFSv4, some attributes are optional, and so we may simply not be able to
fill in those fields. Furthermore, NFSv4 allows you to specify which
attributes you are interested in retrieving, thus permitting you to
optimise away retrieval of attributes that you know will no change...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:24 -04:00
Trond Myklebust 78f945f88e NFSv4: Ignore errors on the post-op attributes in SETATTR calls
There is no need to fail or retry a SETATTR call just because the post-op
GETATTR failed.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:23 -04:00
NeilBrown 37d9d76d8b NFS: flush cached directory information slightly more readily.
If cached directory contents becomes incorrect, there is no way to
flush the contents.  This contrasts with files where file locking is
the recommended way to ensure cache consistency between multiple
applications (a read-lock always flushes the cache).

Also while changes to files often change the size of the file (thus
triggering a cache flush), changes to directories often do not change
the apparent size (as the size is often rounded to a block size).

So it is particularly important with directories to avoid the
possibility of an incorrect cache wherever possible.

When the link count on a directory changes it implies a change in the
number of child directories, and so a change in the contents of this
directory.  So use that as a trigger to flush cached contents.

When the ctime changes but the mtime does not, there are two possible
reasons.
 1/ The owner/mode information has been changed.
 2/ utimes has been used to set the mtime backwards.

In the first case, a data-cache flush is not required.
In the second case it is.

So on the basis that correctness trumps performance, flush the
directory contents cache in this case also.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:23 -04:00
Suresh Jayaraman 2b57dc6cf9 NFS: Minor __nfs_revalidate_inode cleanup
Remove redundant NFS_STALE() check, a leftover due to the commit
691beb13cd

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11 14:10:22 -04:00
David Teigland 1fecb1c4b6 dlm: fix length calculation in compat code
Using offsetof() to calculate name length does not work because
it does not produce consistent results with with structure packing.
This caused memcpy to corrupt memory by copying 4 extra bytes off
the end of the buffer on 64 bit kernels with 32 bit userspace
(the only case where this 32/64 compat code is used).

The fix is to calculate name length directly from the start instead
of trying to derive it later using count and offsetof.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-03-11 12:23:59 -05:00
David Teigland a536e38125 dlm: ignore cancel on granted lock
Return immediately from dlm_unlock(CANCEL) if the lock is
granted and not being converted; there's nothing to cancel.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-03-11 12:23:58 -05:00
David Teigland 43279e5376 dlm: clear defunct cancel state
When a conversion completes successfully and finds that a cancel
of the convert is still in progress (which is now a moot point),
preemptively clear the state associated with outstanding cancel.
That state could cause a subsequent conversion to be ignored.

Also, improve the consistency and content of error and debug
messages in this area.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-03-11 12:23:39 -05:00
Christine Caulfield 5e9ccc372d dlm: replace idr with hash table for connections
Integer nodeids can be too large for the idr code; use a hash
table instead.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-03-11 12:20:58 -05:00
Wu Fengguang ad3bdefe87 proc: fix kflags to uflags copying in /proc/kpageflags
Fix kpf_copy_bit(src,dst) to be kpf_copy_bit(dst,src) to match the
actual call patterns, e.g. kpf_copy_bit(kflags, KPF_LOCKED, PG_locked).

This misplacement of src/dst only affected reporting of PG_writeback,
PG_reclaim and PG_buddy. For others kflags==uflags so not affected.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-11 07:43:33 -07:00
Ian Dall d7371c41b0 Bug 11061, NFS mounts dropped
Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=11061

sockaddr structures can't be reliably compared using memcmp() because
there are padding bytes in the structure which can't be guaranteed to
be the same even when the sockaddr structures refer to the same
socket. Instead compare all the relevant fields. In the case of IPv6
sin6_flowinfo is not compared because it only affects QoS and
sin6_scope_id is only compared if the address is "link local" because
"link local" addresses need only be unique to a specific link.

Signed-off-by: Ian Dall <ian@beware.dropbear.id.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-10 20:33:22 -04:00
Suresh Jayaraman a71ee337b3 NFS: Handle -ESTALE error in access()
Hi Trond,

I have been looking at a bugreport where trying to open applications on KDE
on a NFS mounted home fails temporarily. There have been multiple reports on
different kernel versions pointing to this common issue:
http://bugzilla.kernel.org/show_bug.cgi?id=12557
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html

This issue can be reproducible consistently by doing this on a NFS mounted
home (KDE):
1. Open 2 xterm sessions
2. From one of the xterm session, do "ssh -X <remote host>"
3. "stat ~/.Xauthority" on the remote SSH session
4. Close the two xterm sessions
5. On the server do a "stat ~/.Xauthority"
6. Now on the client, try to open xterm
This will fail.

Even if the filehandle had become stale, the NFS client should invalidate
the cache/inode and should repeat LOOKUP. Looking at the packet capture when
the failure occurs shows that there were two subsequent ACCESS() calls with
the same filehandle and both fails with -ESTALE error.

I have tested the fix below. Now the client issue a LOOKUP after the
ACCESS() call fails with -ESTALE. If all this makes sense to you, can you
consider this for inclusion?

Thanks,


If the server returns an -ESTALE error due to stale filehandle in response to
an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP()
can be retried. Without this change, the nfs client retries ACCESS() with the
same filehandle, fails again and could lead to temporary failure of
applications running on nfs mounted home.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-10 20:33:21 -04:00
Chuck Lever 57df675c60 NLM: Fix GRANT callback address comparison when IPv6 is enabled
The NFS mount command may pass an AF_INET server address to lockd.  If
lockd happens to be using a PF_INET6 listener, the nlm_cmp_addr() in
nlmclnt_grant() will fail to match requests from that host because they
will all have a mapped IPv4 AF_INET6 address.

Adopt the same solution used in nfs_sockaddr_match_ipaddr() for NFSv4
callbacks: if either address is AF_INET, map it to an AF_INET6 address
before doing the comparison.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-10 20:33:20 -04:00
Trond Myklebust ae46141ff0 NFSv3: Fix posix ACL code
Fix a memory leak due to allocation in the XDR layer. In cases where the
RPC call needs to be retransmitted, we end up allocating new pages without
clearing the old ones. Fix this by moving the allocation into
nfs3_proc_setacls().

Also fix an issue discovered by Kevin Rudd, whereby the amount of memory
reserved for the acls in the xdr_buf->head was miscalculated, and causing
corruption.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-10 20:33:18 -04:00
Trond Myklebust ef95d31e6d NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2)
The changeset ea31a4437c (nfs: Fix
misparsing of nfsv4 fs_locations attribute) causes the mountpath that is
calculated at the beginning of try_location() to be clobbered when we
later strncpy a non-nul terminated hostname using an incorrect buffer
length.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-10 20:33:17 -04:00
Alexey Dobriyan 260219cc48 devpts: remove graffiti
Very annoying when working with containters.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-10 15:55:11 -07:00
Eric Sandeen 395a87bfef ext4: fix header check in ext4_ext_search_right() for deep extent trees.
The ext4_ext_search_right() function is confusing; it uses a
"depth" variable which is 0 at the root and maximum at the leaves, 
but the on-disk metadata uses a "depth" (actually eh_depth) which
is opposite: maximum at the root, and 0 at the leaves.

The ext4_ext_check_header() function is given a depth and checks
the header agaisnt that depth; it expects the on-disk semantics,
but we are giving it the opposite in the while loop in this 
function.  We should be giving it the on-disk notion of "depth"
which we can get from (p_depth - depth) - and if you look, the last
(more commonly hit) call to ext4_ext_check_header() does just this.

Sending in the wrong depth results in (incorrect) messages
about corruption:

EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340,
max 340(0), depth 1(2)

http://bugzilla.kernel.org/show_bug.cgi?id=12821

Reported-by: David Dindorp <ddi@dubex.dk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-10 18:18:47 -04:00
Chris Mason 913d952eb5 Btrfs: Clear space_info full when adding new devices
The full flag on the space info structs tells the allocator not to try
and allocate more chunks because the devices in the FS are fully allocated.

When more devices are added, we need to clear the full flag so the allocator
knows it has more space available.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-10 13:17:18 -04:00
Chris Mason 4184ea7f90 Btrfs: Fix locking around adding new space_info
Storage allocated to different raid levels in btrfs is tracked by
a btrfs_space_info structure, and all of the current space_infos are
collected into a list_head.

Most filesystems have 3 or 4 of these structs total, and the list is
only changed when new raid levels are added or at unmount time.

This commit adds rcu locking on the list head, and properly frees
things at unmount time.  It also clears the space_info->full flag
whenever new space is added to the FS.

The locking for the space info list goes like this:

reads: protected by rcu_read_lock()
writes: protected by the chunk_mutex

At unmount time we don't need special locking because all the readers
are gone.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-10 12:39:20 -04:00
Linus Torvalds 1c91ffc896 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix spinlock assertions on UP systems
2009-03-09 09:13:16 -07:00
Chris Mason b9447ef80b Btrfs: fix spinlock assertions on UP systems
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places.  But, it wasn't correct for UP compiled
kernels.

This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-09 11:45:38 -04:00
Linus Torvalds 5b61f6accf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix ext4_free_inode() vs. ext4_claim_inode() race
2009-03-08 10:24:57 -07:00
Christoph Hellwig c141b2928f xfs: only issues a cache flush on unmount if barriers are enabled
Currently we unconditionally issue a flush from xfs_free_buftarg, but
since 2.6.29-rc1 this gives a warning in the style of

	end_request: I/O error, dev vdb, sector 0

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-03-06 17:35:12 -06:00
Christoph Hellwig 7d46be4a25 xfs: prevent lockdep false positive in xfs_iget_cache_miss
The inode can't be locked by anyone else as we just created it a few
lines above and it's not been added to any lookup data structure yet.

So use a trylock that must succeed to get around the lockdep warnings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-03-06 17:34:59 -06:00
Christoph Hellwig ff392c497b xfs: prevent kernel crash due to corrupted inode log format
Andras Korn reported an oops on log replay causes by a corrupted
xfs_inode_log_format_t passing a 0 size to kmem_zalloc.  This patch handles
to small or too large numbers of log regions gracefully by rejecting the
log replay with a useful error message.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-03-06 17:34:45 -06:00
David S. Miller 508827ff0a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tokenring/tmspci.c
	drivers/net/ucc_geth_mii.c
2009-03-05 02:06:47 -08:00
Roel Kluin f4f8056a86 Squashfs: frag_size should be signed, as it can hold an error result
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-03-05 00:55:31 +00:00
Theodore Ts'o 7d39db14a4 ext4: Use struct flex_groups to calculate get_orlov_stats()
Instead of looping over all of the block groups in a flex group
summing their summary statistics, start tracking used_dirs in struct
flex_groups, and use struct flex_groups instead.  This should save a
bit of CPU for mkdir-heavy workloads.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:31:53 -05:00
Phillip Lougher 118e1ef6fa Squashfs: Fix oops when reading fsfuzzer corrupted filesystems
This fixes a code regression caused by the recent mainlining changes.
The recent code changes call zlib_inflate repeatedly, decompressing into
separate 4K buffers, this code didn't check for the possibility that
zlib_inflate might ask for too many buffers when decompressing corrupted
data.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-03-05 00:31:12 +00:00
Theodore Ts'o 9f24e4208f ext4: Use atomic_t's in struct flex_groups
Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's
to track the number of free blocks and inodes in each flex_group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:09:10 -05:00
Theodore Ts'o b713a5ec55 ext4: remove /proc tuning knobs
Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been
replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:11:14 -04:00
Theodore Ts'o 3197ebdb13 ext4: Add sysfs support
Add basic sysfs support so that information about the mounted
filesystem and various tuning parameters can be accessed via
/sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:10:09 -04:00
Eric Sandeen 7ce9d5d1f3 ext4: fix ext4_free_inode() vs. ext4_claim_inode() race
I was seeing fsck errors on inode bitmaps after a 4 thread
dbench run on a 4 cpu machine:

Inode bitmap differences: -50736 -(50752--50753) etc...

I believe that this is because ext4_free_inode() uses atomic
bitops, and although ext4_new_inode() *used* to also use atomic 
bitops for synchronization, commit 
393418676a changed this to use
the sb_bgl_lock, so that we could also synchronize against
read_inode_bitmap and initialization of uninit inode tables.

However, that change left ext4_free_inode using atomic bitops,
which I think leaves no synchronization between setting & 
unsetting bits in the inode table.

The below patch fixes it for me, although I wonder if we're 
getting at all heavy-handed with this spinlock...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 18:38:18 -05:00
Linus Torvalds 36b31106b7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: don't call jbd2_journal_force_commit_nested without journal
  ext4: Reorder fs/Makefile so that ext2 root fs's are mounted using ext2
  ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()
2009-03-02 15:42:26 -08:00
David S. Miller aa4abc9bcc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-tx.c
	net/8021q/vlan_core.c
	net/core/dev.c
2009-03-01 21:35:16 -08:00
Theodore Ts'o afc32f7ee9 ext4: Track lifetime disk writes
Add a new superblock value which tracks the lifetime amount of writes
to the filesystem.  This is useful in estimating the amount of wear on
solid state drives (SSD's) caused by writes to the filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 19:39:58 -05:00
Aneesh Kumar K.V d6014301b5 ext4: Fix discard of inode prealloc space with delayed allocation.
With delayed allocation we should not/cannot discard inode prealloc
space during file close. We would still have dirty pages for which we
haven't allocated blocks yet. With this fix after each get_blocks
request we check whether we have zero reserved blocks and if yes and
we don't have any writers on the file we discard inode prealloc space.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-27 22:36:43 -04:00
Christoph Hellwig 5cf8cf4146 Fix FREEZE/THAW compat_ioctl regression
Commit 8e961870bb removed the FREEZE/THAW
handling in xfs_compat_ioctl but never added any compat handler back, so
now any freeze/thaw request from a 32-bit binary ond 64-bit userspace
will fail.

As these ioctls are 32/64-bit compatible two simple COMPATIBLE_IOCTL
entries in fs/compat_ioctl.c will do the job.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27 16:27:45 -08:00
Benny Halevy adc487204a EXPORT_SYMBOL(d_obtain_alias) rather than EXPORT_SYMBOL_GPL
Commit 4ea3ada295 declares d_obtain_alias()
as EXPORT_SYMBOL_GPL where it's supposed to replace d_alloc_anon which was
previously declared as EXPORT_SYMBOL and thus available to any loadable
module.

This patch reverts that.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27 16:26:20 -08:00
Linus Torvalds 221be177e6 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  [MTD] [MAPS] Remove MODULE_DEVICE_TABLE() from ck804rom driver.
  [JFFS2] fix mount crash caused by removed nodes
  [JFFS2] force the jffs2 GC daemon to behave a bit better
  [MTD] [MAPS] blackfin async requires complex mappings
  [MTD] [MAPS] blackfin: fix memory leak in error path
  [MTD] [MAPS] physmap: fix wrong free and del_mtd_{partition,device}
  [MTD] slram: Handle negative devlength correctly
  [MTD] map_rom has NULL erase pointer
  [MTD] [LPDDR] qinfo_probe depends on lpddr
2009-02-26 14:45:57 -08:00
wengang wang 28d57d4377 ocfs2: add IO error check in ocfs2_get_sector()
Check for IO error in ocfs2_get_sector().

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:12 -08:00
Tiger Yang 4442f51826 ocfs2: set gap to seperate entry and value when xattr in bucket
This patch set a gap (4 bytes) between xattr entry and
name/value when xattr in bucket. This gap use to seperate
entry and name/value when a bucket is full. It had already
been set when xattr in inode/block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:11 -08:00
Tao Ma c8b9cf9a7c ocfs2: lock the metaecc process for xattr bucket
For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:11 -08:00
Tao Ma 89a907afe0 ocfs2: Use the right access_* method in ctime update of xattr.
In ctime updating of xattr, it use the wrong type of access for
inode, so use ocfs2_journal_access_di instead.

Reported-and-Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:11 -08:00
Sunil Mushran 53ecd25e14 ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter
In dlm_assert_master_handler(), if we get an incorrect assert master from a node
that, we reply with EINVAL asking the asserter to die. The problem is that an
assert is sent after so many hoops, it is invariably the node that thinks the
asserter is wrong, is actually wrong. So instead of killing the asserter, this
patch kills the assertee.

This patch papers over a race that is still being addressed.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:11 -08:00
Sunil Mushran dabc47de7a ocfs2/dlm: Use ast_lock to protect ast_list
The code was using dlm->spinlock instead of dlm->ast_lock to protect the
ast_list. This patch fixes the issue.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:09 -08:00
Sunil Mushran c74ff8bb22 ocfs2: Cleanup the lockname print in dlmglue.c
The dentry lock has a different format than other locks. This patch fixes
ocfs2_log_dlm_error() macro to make it print the dentry lock correctly.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:09 -08:00
Sunil Mushran 7dc102b737 ocfs2/dlm: Retract fix for race between purge and migrate
Mainline commit d4f7e650e5 attempts to delay
the dlm_thread from sending the drop ref message if the lockres is being
migrated. The problem is that we make the dlm_thread wait for the migration
to complete. This causes a deadlock as dlm_thread also participates in the
lockres migration process.

A better fix for the original oss bugzilla#1012 is in testing.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:09 -08:00
Tao Ma 47be12e4ee ocfs2: Access and dirty the buffer_head in mark_written.
In __ocfs2_mark_extent_written, when we meet with the situation
of c_split_covers_rec, the old solution just replace the extent
record and forget to access and dirty the buffer_head. This will
cause a problem when the unwritten extent is in an extent block.
So access and dirty it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:09 -08:00
Linus Torvalds 64e71303e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: try committing transaction before returning ENOSPC
  Btrfs: add better -ENOSPC handling
2009-02-26 10:37:00 -08:00
Jens Axboe b2bf96833c block: fix bogus gcc warning for uninitialized var usage
Newer gcc throw this warning:

        fs/bio.c: In function ?bio_alloc_bioset?:
        fs/bio.c:305: warning: ?p? may be used uninitialized in this function

since it cannot figure out that 'p' is only ever used if 'bs' is non-NULL.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-26 10:45:48 +01:00
Eric Sandeen 8f64b32eb7 ext4: don't call jbd2_journal_force_commit_nested without journal
Running without a journal, I oopsed when I ran out of space,
because we called jbd2_journal_force_commit_nested() from
ext4_should_retry_alloc() without a journal.

This should take care of it, I think.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-26 00:57:35 -05:00
Theodore Ts'o d8ae4601a4 ext4: Reorder fs/Makefile so that ext2 root fs's are mounted using ext2
In fs/Makefile, ext3 was placed before ext2 so that a root filesystem
that possessed a journal, it would be mounted as ext3 instead of ext2.
This was necessary because a cleanly unmounted ext3 filesystem was
fully backwards compatible with ext2, and could be mounted by ext2 ---
but it was desirable that it be mounted with ext3 so that the
journaling would be enabled.

The ext4 filesystem supports new incompatible features, so there is no
danger of an ext4 filesystem being mistaken for an ext2 filesystem.
At that point, the relative ordering of ext4 with respect to ext2
didn't matter until ext4 gained the ability to mount filesystems
without a journal starting in 2.6.29-rc1.  Now that this is the case,
given that ext4 is before ext2, it means that root filesystems that
were using the plain-jane ext2 format are getting mounted using the
ext4 filesystem driver, which is a change in behavior which could be
surprising to users.

It's doubtful that there are that many ext2-only root filesystem users
that would also have ext4 compiled into the kernel, but to adhere to
the principle of least surprise, the correct ordering in fs/Makefile
is ext3, followed by ext2, and finally ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 09:50:01 -05:00
Theodore Ts'o 8b1a8ff8b3 ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()
Commit c4be0c1d added error checking to ext4_freeze() when calling
ext4_commit_super().  Unfortunately the patch failed to remove the
original call to ext4_commit_super(), with the net result that when
freezing the filesystem, the superblock gets written twice, the first
time without error checking.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 00:08:53 -05:00
Linus Torvalds 694593e337 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
  proc: fix PG_locked reporting in /proc/kpageflags
2009-02-24 15:42:08 -08:00
Linus Torvalds 4daa0682af Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin()
  ext4: Add fallback for find_group_flex
2009-02-24 15:39:34 -08:00
Helge Bahmann e07a4b9217 proc: fix PG_locked reporting in /proc/kpageflags
Expr always evaluates to zero.

Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-02-24 21:17:58 +03:00
David S. Miller e70049b9e7 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-02-24 03:50:29 -08:00
Theodore Ts'o 8750c6d5fc ext4: Automatically allocate delay allocated blocks on rename
When renaming a file such that a link to another inode is overwritten,
force any delay allocated blocks that to be allocated so that if the
filesystem is mounted with data=ordered, the data blocks will be
pushed out to disk along with the journal commit.  Many application
programs expect this, so we do this to avoid zero length files if the
system crashes unexpectedly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-23 23:05:27 -05:00
Theodore Ts'o 7d8f9f7d15 ext4: Automatically allocate delay allocated blocks on close
When closing a file that had been previously truncated, force any
delay allocated blocks that to be allocated so that if the filesystem
is mounted with data=ordered, the data blocks will be pushed out to
disk along with the journal commit.  Many application programs expect
this, so we do this to avoid zero length files if the system crashes
unexpectedly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-24 08:21:14 -05:00
Theodore Ts'o ccd2506bd4 ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
Add an ioctl which forces all of the delay allocated blocks to be
allocated.  This also provides a function ext4_alloc_da_blocks() which
will be used by the following commits to force files to be fully
allocated to preserve application-expected ext3 behaviour.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-26 01:04:07 -05:00
Krzysztof Sachanowicz cac711211a proc: proc_get_inode should de_put when inode already initialized
de_get is called before every proc_get_inode, but corresponding de_put is
called only when dropping last reference to an inode. This might cause
something like
remove_proc_entry: /proc/stats busy, count=14496
to be printed to the syslog.

The fix is to call de_put in case of an already initialized inode in
proc_get_inode.

Signed-off-by: Krzysztof Sachanowicz <analyzer1@gmail.com>
Tested-by: Marcin Pilipczuk <marcin.pilipczuk@gmail.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-23 18:25:32 -08:00
Theodore Ts'o f63e6005bc ext4: Simplify delalloc code by removing mpage_da_writepages()
The mpage_da_writepages() function is only used in one place, so
inline it to simplify the call stack and make the code easier to
understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-23 16:42:39 -05:00
Theodore Ts'o 8dc207c0e7 ext4: Save stack space by removing fake buffer heads
Struct mpage_da_data and mpage_add_bh_to_extent() use a fake struct
buffer_head which is 104 bytes on an x86_64 system, but only use 24
bytes of the structure.  On systems that use a spinlock for atomic_t,
the stack savings will be even greater.

It turns out that using a fake struct buffer_head doesn't even save
that much code, and it makes the code more confusing since it's not
used as a "real" buffer head.  So just store pass b_size and b_state
in mpage_add_bh_to_extent(), and store b_size, b_state, and b_block_nr
in the mpage_da_data structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-23 06:46:01 -05:00
Theodore Ts'o ed5bde0bf8 ext4: Simplify delalloc implementation by removing mpd.get_block
This parameter was always set to ext4_da_get_block_write().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-23 10:48:07 -05:00
Aneesh Kumar K.V 7a262f7c69 ext4: Validate extent details only when read from the disk
Make sure we validate extent details only when read from the disk.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-27 16:39:58 -04:00
Aneesh Kumar K.V 56b19868ac ext4: Add checks to validate extent entries.
This patch adds checks to validate the extent entries along with extent
headers, to avoid crashes caused by corrupt filesystems.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-12 09:51:20 -04:00
Bryan Donlan e6f009b0b4 ext4: return -EIO not -ESTALE on directory traversal through deleted inode
ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly.  However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode.  This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.

The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.

This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.

Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-22 21:20:25 -05:00
Theodore Ts'o a4912123b6 ext4: New inode/block allocation algorithms for flex_bg filesystems
The find_group_flex() inode allocator is now only used if the
filesystem is mounted using the "oldalloc" mount option.  It is
replaced with the original Orlov allocator that has been updated for
flex_bg filesystems (it should behave the same way if flex_bg is
disabled).  The inode allocator now functions by taking into account
each flex_bg group, instead of each block group, when deciding whether
or not it's time to allocate a new directory into a fresh flex_bg.

The block allocator has also been changed so that the first block
group in each flex_bg is preferred for use for storing directory
blocks.  This keeps directory blocks close together, which is good for
speeding up e2fsck since large directories are more likely to look
like this:

debugfs:  stat /home/tytso/Maildir/cur
Inode: 1844562   Type: directory    Mode:  0700   Flags: 0x81000
Generation: 1132745781    Version: 0x00000000:0000ad71
User: 15806   Group: 15806   Size: 1060864
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 2072
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
 atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
 mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
Size of extra inode fields: 28
BLOCKS:
(0):7348651, (1-258):7348654-7348911
TOTAL: 259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-12 12:18:34 -04:00
Jan Kara ebd3610b11 ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin()
Functions ext4_write_begin() and ext4_da_write_begin() call
grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it
can happen that page reclaim is triggered in that function
and it recurses back into the filesystem (or some other filesystem).
But this can lead to various problems as a transaction is already
started at that point. Add the necessary flag.

http://bugzilla.kernel.org/show_bug.cgi?id=11688

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-22 21:09:59 -05:00
Theodore Ts'o 05bf9e839d ext4: Add fallback for find_group_flex
This is a workaround for find_group_flex() which badly needs to be
replaced.  One of its problems (besides ignoring the Orlov algorithm)
is that it is a bit hyperactive about returning failure under
suspicious circumstances.  This can lead to spurious ENOSPC failures
even when there are inodes still available.

Work around this for now by retrying the search using
find_group_other() if find_group_flex() returns -1.  If
find_group_other() succeeds when find_group_flex() has failed, log a
warning message.

A better block/inode allocator that will fix this problem for real has
been queued up for the next merge window.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-21 12:13:24 -05:00
Linus Torvalds 710320d579 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts
  [CIFS] improve posix semantics of file create
  [CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS
  cifs: posix fill in inode needed by posix open
  cifs: properly handle case where CIFSGetSrvInodeNumber fails
  cifs: refactor new_inode() calls and inode initialization
  [CIFS] Prevent OOPs when mounting with remote prefixpath.
  [CIFS] ipv6_addr_equal for address comparison
2009-02-21 09:11:28 -08:00
Thomas Gleixner 4c41bd0ec9 [JFFS2] fix mount crash caused by removed nodes
At scan time we observed following scenario:

   node A inserted
   node B inserted
   node C inserted -> sets overlapped flag on node B

   node A is removed due to CRC failure -> overlapped flag on node B remains

   while (tn->overlapped)
   	 tn = tn_prev(tn);

   ==> crash, when tn_prev(B) is referenced.

When the ultimate node is removed at scan time and the overlapped flag
is set on the penultimate node, then nothing updates the overlapped
flag of that node. The overlapped iterators blindly expect that the
ultimate node does not have the overlapped flag set, which causes the
scan code to crash.

It would be a huge overhead to go through the node chain on node
removal and fix up the overlapped flags, so detecting such a case on
the fly in the overlapped iterators is a simpler and reliable
solution.

Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-02-21 11:09:29 +01:00
Steve French eca6acf915 [CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts
When two different users mount the same Windows 2003 Server share using CIFS,
the first session mounted can be invalidated.  Some servers invalidate the first
smb session when a second similar user (e.g. two users who get mapped by server to "guest")
authenticates an smb session from the same client.

By making sure that we set the 2nd and subsequent vc numbers to nonzero values,
this ensures that we will not have this problem.

Fixes Samba bug 6004, problem description follows:
How to reproduce:

- configure an "open share" (full permissions to Guest user) on Windows 2003
Server (I couldn't reproduce the problem with Samba server or Windows older
than 2003)
- mount the share twice with different users who will be authenticated as guest.

 noacl,noperm,user=john,dir_mode=0700,domain=DOMAIN,rw
 noacl,noperm,user=jeff,dir_mode=0700,domain=DOMAIN,rw

Result:

- just the mount point mounted last is accessible:

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:10 +00:00
Steve French c3b2a0c640 [CIFS] improve posix semantics of file create
Samba server added support for a new posix open/create/mkdir operation
a year or so ago, and we added support to cifs for mkdir to use it,
but had not added the corresponding code to file create.

The following patch helps improve the performance of the cifs create
path (to Samba and servers which support the cifs posix protocol
extensions).  Using Connectathon basic test1, with 2000 files, the
performance improved about 15%, and also helped reduce network traffic
(17% fewer SMBs sent over the wire) due to saving a network round trip
for the SetPathInfo on every file create.

It should also help the semantics (and probably the performance) of
write (e.g. when posix byte range locks are on the file) on file
handles opened with posix create, and adds support for a few flags
which would have to be ignored otherwise.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:09 +00:00
Steve French 69765529d7 [CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS
Fixes kernel bug #10451 http://bugzilla.kernel.org/show_bug.cgi?id=10451

Certain NAS appliances do not set the operating system or network operating system
fields in the session setup response on the wire.  cifs was oopsing on the unexpected
zero length response fields (when trying to null terminate a zero length field).

This fixes the oops.

Acked-by: Jeff Layton <jlayton@redhat.com>
CC: stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:09 +00:00
Jeff Layton 44f68fadd8 cifs: posix fill in inode needed by posix open
function needed to prepare for posix open

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:08 +00:00
Jeff Layton 950ec52880 cifs: properly handle case where CIFSGetSrvInodeNumber fails
...if it does then we pass a pointer to an unintialized variable for
the inode number to cifs_new_inode. Have it pass a NULL pointer instead.

Also tweak the function prototypes to reduce the amount of casting.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:08 +00:00
Jeff Layton 132ac7b77c cifs: refactor new_inode() calls and inode initialization
Move new inode creation into a separate routine and refactor the
callers to take advantage of it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:37:07 +00:00
Igor Mammedov e4cce94c9c [CIFS] Prevent OOPs when mounting with remote prefixpath.
Fixes OOPs with message 'kernel BUG at fs/cifs/cifs_dfs_ref.c:274!'.
Checks if the prefixpath in an accesible while we are still in cifs_mount
and fails with reporting a error if we can't access the prefixpath

Should fix Samba bugs 6086 and 5861 and kernel bug 12192

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-02-21 03:36:21 +00:00
Linus Torvalds 264b299006 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check file pointer in btrfs_sync_file
2009-02-20 17:59:14 -08:00
Josef Bacik 4e06bdd6cb Btrfs: try committing transaction before returning ENOSPC
This fixes a problem where we could return -ENOSPC when we may actually have
plenty of space, the space is just pinned.  Instead of returning -ENOSPC
immediately, commit the transaction first and then try and do the allocation
again.

This patch also does chunk allocation for metadata if we pass the 80%
threshold for metadata space.  This will help with stack usage since the chunk
allocation will happen early on, instead of when the allocation is happening.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 10:59:53 -05:00
Josef Bacik 6a63209fc0 Btrfs: add better -ENOSPC handling
This is a step in the direction of better -ENOSPC handling.  Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line. 

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point.  When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
not actually being able to allocate space for any delalloc bytes.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 11:00:09 -05:00
Chris Mason 2cfbd50b53 Btrfs: check file pointer in btrfs_sync_file
fsync can be called by NFS with a null file pointer, and btrfs was
oopsing in this case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-20 10:55:10 -05:00
Linus Torvalds 620565ef5f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  Revert "[XFS] remove old vmap cache"
  Revert "[XFS] use scalable vmap API"
2009-02-19 13:09:32 -08:00
Felix Blyakher 27e88bf6af Revert "[XFS] remove old vmap cache"
This reverts commit d2859751cd.

This commit caused regression. We'll try to fix use of new
vmap API for next release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-02-19 13:15:55 -06:00
Felix Blyakher 7fdf582447 Revert "[XFS] use scalable vmap API"
This reverts commit 95f8e302c0.

This commit caused regression. We'll try to fix use of new
vmap API for next release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-02-19 13:15:44 -06:00
Linus Torvalds ba95fd47d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
  block: fix booting from partitioned md array
  block: revert part of 18ce3751cc
  cciss: PCI power management reset for kexec
  paride/pg.c: xs(): &&/|| confusion
  fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
  block: fix bad definition of BIO_RW_SYNC
  bsg: Fix sense buffer bug in SG_IO
2009-02-18 18:33:04 -08:00
Ingo Molnar f04b30de3c inotify: fix GFP_KERNEL related deadlock
Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep
assert:

[ 1093.677775]
[ 1093.677781] =================================
[ 1093.680031] [ INFO: inconsistent lock state ]
[ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] ---------------------------------
[ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1093.680031]  (&inode->inotify_mutex){+.+.?.}, at: [<c0205942>] inotify_inode_is_dead+0x20/0x80
[ 1093.680031] {RECLAIM_FS-ON-W} state was registered at:
[ 1093.680031]   [<c01696b9>] mark_held_locks+0x43/0x5b
[ 1093.680031]   [<c016baa4>] lockdep_trace_alloc+0x6c/0x6e
[ 1093.680031]   [<c01cf8b0>] kmem_cache_alloc+0x20/0x150
[ 1093.680031]   [<c040d0ec>] idr_pre_get+0x27/0x6c
[ 1093.680031]   [<c02056e3>] inotify_handle_get_wd+0x25/0xad
[ 1093.680031]   [<c0205f43>] inotify_add_watch+0x7a/0x129
[ 1093.680031]   [<c020679e>] sys_inotify_add_watch+0x20f/0x250
[ 1093.680031]   [<c010389e>] sysenter_do_call+0x12/0x35
[ 1093.680031]   [<ffffffff>] 0xffffffff
[ 1093.680031] irq event stamp: 60417
[ 1093.680031] hardirqs last  enabled at (60417): [<c018d5f5>] call_rcu+0x53/0x59
[ 1093.680031] hardirqs last disabled at (60416): [<c018d5b9>] call_rcu+0x17/0x59
[ 1093.680031] softirqs last  enabled at (59656): [<c0146229>] __do_softirq+0x157/0x16b
[ 1093.680031] softirqs last disabled at (59651): [<c0106293>] do_softirq+0x74/0x15d
[ 1093.680031]
[ 1093.680031] other info that might help us debug this:
[ 1093.680031] 2 locks held by kswapd0/308:
[ 1093.680031]  #0:  (shrinker_rwsem){++++..}, at: [<c01b0502>] shrink_slab+0x36/0x189
[ 1093.680031]  #1:  (&type->s_umount_key#4){+++++.}, at: [<c01e6d77>] shrink_dcache_memory+0x110/0x1fb
[ 1093.680031]
[ 1093.680031] stack backtrace:
[ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1
[ 1093.680031] Call Trace:
[ 1093.680031]  [<c016947a>] valid_state+0x12a/0x13d
[ 1093.680031]  [<c016954e>] mark_lock+0xc1/0x1e9
[ 1093.680031]  [<c016a5b4>] ? check_usage_forwards+0x0/0x3f
[ 1093.680031]  [<c016ab74>] __lock_acquire+0x2c6/0xac8
[ 1093.680031]  [<c01688d9>] ? register_lock_class+0x17/0x228
[ 1093.680031]  [<c016b3d3>] lock_acquire+0x5d/0x7a
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c08824c4>] __mutex_lock_common+0x3a/0x4cb
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c08829ed>] mutex_lock_nested+0x2e/0x36
[ 1093.680031]  [<c0205942>] ? inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c0205942>] inotify_inode_is_dead+0x20/0x80
[ 1093.680031]  [<c01e6672>] dentry_iput+0x90/0xc2
[ 1093.680031]  [<c01e67a3>] d_kill+0x21/0x45
[ 1093.680031]  [<c01e6a46>] __shrink_dcache_sb+0x27f/0x355
[ 1093.680031]  [<c01e6dc5>] shrink_dcache_memory+0x15e/0x1fb
[ 1093.680031]  [<c01b05ed>] shrink_slab+0x121/0x189
[ 1093.680031]  [<c01b0d12>] kswapd+0x39f/0x561
[ 1093.680031]  [<c01ae499>] ? isolate_pages_global+0x0/0x233
[ 1093.680031]  [<c0157eae>] ? autoremove_wake_function+0x0/0x43
[ 1093.680031]  [<c01b0973>] ? kswapd+0x0/0x561
[ 1093.680031]  [<c0157daf>] kthread+0x41/0x82
[ 1093.680031]  [<c0157d6e>] ? kthread+0x0/0x82
[ 1093.680031]  [<c01043ab>] kernel_thread_helper+0x7/0x10

inotify_handle_get_wd() does idr_pre_get() which does a
kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under
extreme MM pressure.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: MinChan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:56 -08:00
Bill Nottingham 2db69a9340 vt: Declare PIO_CMAP/GIO_CMAP as compatbile ioctls.
Otherwise, these don't work when called from 32-bit userspace on 64-bit
kernels.

Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:56 -08:00
Peter Zijlstra ada723dcd6 fs/super.c: add lockdep annotation to s_umount
Li Zefan said:

Thread 1:
  for ((; ;))
  {
      mount -t cpuset xxx /mnt > /dev/null 2>&1
      cat /mnt/cpus > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

Thread 2:
  for ((; ;))
  {
      mount -t cpuset xxx /mnt > /dev/null 2>&1
      umount /mnt > /dev/null 2>&1
  }

(Note: It is irrelevant which cgroup subsys is used.)

After a while a lockdep warning showed up:

=============================================
[ INFO: possible recursive locking detected ]
2.6.28 #479
---------------------------------------------
mount/13554 is trying to acquire lock:
 (&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321

but task is already holding lock:
 (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321

other info that might help us debug this:
1 lock held by mount/13554:
 #0:  (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321

stack backtrace:
Pid: 13554, comm: mount Not tainted 2.6.28-mc #479
Call Trace:
 [<c044ad2e>] validate_chain+0x4c6/0xbbd
 [<c044ba9b>] __lock_acquire+0x676/0x700
 [<c044bb82>] lock_acquire+0x5d/0x7a
 [<c049d888>] ? sget+0x5e/0x321
 [<c061b9b8>] down_write+0x34/0x50
 [<c049d888>] ? sget+0x5e/0x321
 [<c049d888>] sget+0x5e/0x321
 [<c045a2e7>] ? cgroup_set_super+0x0/0x3e
 [<c045959f>] ? cgroup_test_super+0x0/0x2f
 [<c045bcea>] cgroup_get_sb+0x98/0x2e7
 [<c045cfb6>] cpuset_get_sb+0x4a/0x5f
 [<c049dfa4>] vfs_kern_mount+0x40/0x7b
 [<c049e02d>] do_kern_mount+0x37/0xbf
 [<c04af4a0>] do_mount+0x5c3/0x61a
 [<c04addd2>] ? copy_mount_options+0x2c/0x111
 [<c04af560>] sys_mount+0x69/0xa0
 [<c0403251>] sysenter_do_call+0x12/0x31

The cause is after alloc_super() and then retry, an old entry in list
fs_supers is found, so grab_super(old) is called, but both functions hold
s_umount lock:

struct super_block *sget(...)
{
	...
retry:
	spin_lock(&sb_lock);
	if (test) {
		list_for_each_entry(old, &type->fs_supers, s_instances) {
			if (!test(old, data))
				continue;
			if (!grab_super(old))  <--- 2nd: down_write(&old->s_umount);
				goto retry;
			if (s)
				destroy_super(s);
			return old;
		}
	}
	if (!s) {
		spin_unlock(&sb_lock);
		s = alloc_super(type);   <--- 1th: down_write(&s->s_umount)
		if (!s)
			return ERR_PTR(-ENOMEM);
		goto retry;
	}
	...
}

It seems like a false positive, and seems like VFS but not cgroup needs to
be fixed.

Peter said:

We can simply put the new s_umount instance in a but lockdep doesn't
particularly cares about subclass order.

If there's any issue with the callers of sget() assuming the s_umount lock
being of sublcass 0, then there is another annotation we can use to fix
that, but lets not bother with that if this is sufficient.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:55 -08:00
Nick Piggin 1cf6e7d83b mm: task dirty accounting fix
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

Additionally, there is some inconsistency about when task_dirty_inc is
called.  It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.

So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.

Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:54 -08:00
Davide Libenzi 610d18f412 timerfd: add flags check
As requested by Michael, add a missing check for valid flags in
timerfd_settime(), and make it return EINVAL in case some extra bits are
set.

Michael said:
If this is to be any use to userland apps that want to check flag
support (perhaps it is too late already), then the sooner we get it
into the kernel the better: 2.6.29 would be good; earlier stables as
well would be even better.

[akpm@linux-foundation.org: remove unused TFD_FLAGS_SET]
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>		[2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:53 -08:00
Eric Biederman 8f19d47293 seq_file: properly cope with pread
Currently seq_read assumes that the offset passed to it is always the
offset it passed to user space.  In the case pread this assumption is
broken and we do the wrong thing when presented with pread.

To solve this I introduce an offset cache inside of struct seq_file so we
know where our logical file position is.  Then in seq_read if we try to
read from another offset we reset our data structures and attempt to go to
the offset user space wanted.

[akpm@linux-foundation.org: restore FMODE_PWRITE]
[pjt@google.com: seq_open needs its fmode opened up to take advantage of this]
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Turner <pjt@google.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:53 -08:00
Jens Axboe 78f707bfc7 block: revert part of 18ce3751cc
The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).

---- test case ----

int main(int argc, char **argv)
{

	int fdes, i;
	FILE *fp;
	struct timeval start;
	struct timeval end;
	struct timeval res;

	gettimeofday(&start, NULL);
	for (i=0; i<ROWS; i++) {
		fp = fopen("test_file", "a");
		fprintf(fp, "Some Text Data\n");
		fdes = fileno(fp);
		fsync(fdes);
		fclose(fp);
	}
	gettimeofday(&end, NULL);

	timersub(&end, &start, &res);
	fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
			(res.tv_sec*1000000 + res.tv_usec)/1000);

	return 0;
}

-------------------

Thanks to Sean.White@APCC.com for tracking down this performance
regression and providing a test case.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Subhash Peddamallu a60e78e57a fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
When freeing from bio pool use right ptr to account for bs->front_pad,
instead of bio ptr,

Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Linus Torvalds 48c0d9ece3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
  Btrfs: make a lockdep class for the extent buffer locks
  Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
  Btrfs: remove unused code in split_state()
  Btrfs: remove btrfs_init_path
  Btrfs: balance_level checks !child after access
  Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
  Btrfs: don't clean old snapshots on sync(1)
  Btrfs: use larger metadata clusters in ssd mode
  Btrfs: process mount options on mount -o remount,
  Btrfs: make sure all pending extent operations are complete
2009-02-17 14:19:14 -08:00
Linus Torvalds 3512a79dbc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
  ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
  ext4: Initialize preallocation list_head's properly
  ext4: Fix lockdep warning
  ext4: Fix to read empty directory blocks correctly in 64k
  jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
  Revert "ext4: wait on all pending commits in ext4_sync_fs()"
  jbd2: Fix return value of jbd2_journal_start_commit()
2009-02-17 14:05:05 -08:00
Al Viro 1a88b5364b Fix incomplete __mntput locking
Getting this wrong caused

	WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()

due to optimistically checking cpu_writer->mnt outside the spinlock.

Here's what we really want:
 * we know that nobody will set cpu_writer->mnt to mnt from now on
 * all changes to that sucker are done under cpu_writer->lock
 * we want the laziest equivalent of
	spin_lock(&cpu_writer->lock);
	if (likely(cpu_writer->mnt != mnt)) {
		spin_unlock(&cpu_writer->lock);
		continue;
	}
	/* do stuff */
  that would make sure we won't miss earlier setting of ->mnt done by
  another CPU.

Anyway, for now we just move the spin_lock() earlier and move the test
into the properly locked region.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-17 14:02:08 -08:00
Patrick Ohly d24fff22d8 net: pass new SIOCSHWTSTAMP through to device drivers
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:38 -08:00
Dan Carpenter 090542641d ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
This was found through a code checker (http://repo.or.cz/w/smatch.git/). 
It looks like you might be able to trigger the error by trying to migrate 
a readonly file system.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 20:02:19 -05:00
Duane Griffin 2dc6b0d48c ext4: tighten restrictions on inode flags
At the moment there are few restrictions on which flags may be set on
which inodes.  Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 18:09:20 -05:00
Duane Griffin 8fa43a81b9 ext4: don't inherit inappropriate inode flags from parent
At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent.  In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited. 
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 18:57:26 -05:00
Pekka Enberg 705895b611 ext4: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext4_sb_info
enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of
32 KB saving 14 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 18:07:52 -05:00
David S. Miller 5e30589521 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2009-02-14 23:12:00 -08:00
Wei Yongjun 3d0518f475 ext4: New rec_len encoding for very large blocksizes
The rec_len field in the directory entry is 16 bits, so to encode
blocksizes larger than 64k becomes problematic.  This patch allows us
to supprot block sizes up to 256k, by using the low 2 bits to extend
the range of rec_len to 2**18-1 (since valid rec_len sizes must be a
multiple of 4).  We use the convention that a rec_len of 0 or 65535
means the filesystem block size, for compatibility with older kernels.

It's unlikely we'll see VM pages of up to 256k, but at some point we
might find that the Linux VM has been enhanced to support filesystem
block sizes > than the VM page size, at which point it might be useful
for some applications to allow very large filesystem block sizes.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 23:01:36 -05:00
Theodore Ts'o 8bad4597c2 ext4: Use unsigned int for blocksize in dx_make_map() and dx_pack_dirents()
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 21:46:54 -05:00
Aneesh Kumar K.V 2acf2c261b ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks.  This is needed
so that we can get large contiguous blocks request.  If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range.  That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.

The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.

http://bugzilla.kernel.org/show_bug.cgi?id=12579

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:42:58 -05:00
Aneesh Kumar K.V d794bf8e09 ext4: Initialize preallocation list_head's properly
When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists.  Otherwise, with list debug enabled, we will get list
corruption warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:31:16 -05:00
Andres Salomon efab0b5d3e [JFFS2] force the jffs2 GC daemon to behave a bit better
I've noticed some pretty poor behavior on OLPC machines after bootup, when
gdm/X are starting.  The GCD monopolizes the scheduler (which in turns
means it gets to do more nand i/o), which results in processes taking much
much longer than they should to start.

As an example, on an OLPC machine going from OFW to a usable X (via
auto-login gdm) takes 2m 30s.  The majority of this time is consumed by
the switch into graphical mode.  With this patch, we cut a full 60s off of
bootup time.  After bootup, things are much snappier as well.

Note that we have seen a CRC node error with this patch that causes the machine
to fail to boot, but we've also seen that problem without this patch.

Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-02-14 08:59:04 +00:00
Yan Zheng 2456242530 Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction.  This adds
it to all the places that were missing it.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 14:14:53 -05:00
Chris Mason 4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Julia Lawall 3f3420df50 Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.

The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@

(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
     when != if (...) { <+...x...+> }
x->f = E
...>
(
 return \(0\|<+...x...+>\|ptr\);
|
 return@p2 ...;
)

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:16:03 -05:00
Qinghuang Feng a48ddf08ba Btrfs: remove unused code in split_state()
These two lines are not used, remove them.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:25:23 -05:00
Jeff Mahoney e00f730865 Btrfs: remove btrfs_init_path
btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:11:25 -05:00
Jeff Mahoney 7951f3cefb Btrfs: balance_level checks !child after access
The BUG_ON() is in the wrong spot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:06:15 -05:00
Yan Zheng b335b0034e Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 10:06:04 -05:00
Chris Mason e1df36d2f1 Btrfs: don't clean old snapshots on sync(1)
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.

This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view.  The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed.  A new ioctl will be added
for this purpose instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:45:08 -05:00
Chris Mason 536ac8ae86 Btrfs: use larger metadata clusters in ssd mode
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:41:38 -05:00
Chris Mason b288052e17 Btrfs: process mount options on mount -o remount,
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:37:35 -05:00
Josef Bacik eb09967089 Btrfs: make sure all pending extent operations are complete
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return.  I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12 09:27:38 -05:00
Kentaro Takeda f9ce1f1cda Add in_execve flag into task_struct.
This patch allows LSM modules to determine whether current process is in an
execve operation or not so that they can behave differently while an execve
operation is in progress.

This patch is needed by TOMOYO. Please see another patch titled "LSM adapter
functions." for backgrounds.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-12 15:15:03 +11:00
Carsten Otte 0e4a9b5928 ext2/xip: refuse to change xip flag during remount with busy inodes
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around.  According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:

commit 70688e4dd1
Author: Nick Piggin <npiggin@suse.de>
Date:   Mon Apr 28 02:13:02 2008 -0700

    xip: support non-struct page backed memory

In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x.  During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork).  Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.

Both parent and child process crash some time later due to inconsistencies
in their address spaces.  The issue seems to be timing sensitive, various
attempts to recreate it have failed.

This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated.  This patch keeps users from running into
that issue.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:36 -08:00
Jan Kara 02ac597c9b ext3: revert "ext3: wait on all pending commits in ext3_sync_fs"
This reverts commit c87591b719.

Since journal_start_commit() is now fixed to return 1 when we started a
transaction commit, there's some transaction waiting to be committed or
there's a transaction already committing, we don't need to call
ext3_force_commit() in ext3_sync_fs().  Furthermore ext3_force_commit()
can unnecessarily create sync transaction which is expensive so it's
worthwhile to remove it when we can.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Jan Kara 8fe4cd0dc5 jbd: fix return value of journal_start_commit()
journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit.  But it returns 0 if we
raced with somebody queueing the transaction commit as well.  This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mike Snitzer <snitzer@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Mel Gorman 5a6fe12595 Do not account for the address space used by hugetlbfs using VM_ACCOUNT
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc8, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-10 10:48:42 -08:00
Aneesh Kumar K.V ba4439165f ext4: Fix lockdep warning
We should not call ext4_mb_add_n_trim while holding alloc_semp.

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.29-rc4-git1-dirty #124
    ---------------------------------------------
    ffsb/3116 is trying to acquire lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

    but task is already holding lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

http://bugzilla.kernel.org/show_bug.cgi?id=12672

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 11:14:34 -05:00
Wei Yongjun 7be2baaa03 ext4: Fix to read empty directory blocks correctly in 64k
The rec_len field in the directory entry is 16 bits, so there was a
problem representing rec_len for filesystems with a 64k block size in
the case where the directory entry takes the entire 64k block.
Unfortunately, there were two schemes that were proposed; one where
all zeros meant 65536 and one where all ones (65535) meant 65536.
E2fsprogs used 0, whereas the kernel used 65535.  Oops.  Fortunately
this case happens extremely rarely, with the most common case being
the lost+found directory, created by mke2fs.

So we will be liberal in what we accept, and accept both encodings,
but we will continue to encode 65536 as 65535.  This will require a
change in e2fsprogs, but with fortunately ext4 filesystems normally
have the dir_index feature enabled, which precludes having a
completely empty directory block.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 09:53:42 -05:00
Jan Kara 7f5aa21508 jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
2009-02-10 11:15:34 -05:00
Jan Kara 9eddacf9e9 Revert "ext4: wait on all pending commits in ext4_sync_fs()"
This undoes commit 14ce0cb411.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
2009-02-10 06:46:05 -05:00
Jan Kara c88ccea314 jbd2: Fix return value of jbd2_journal_start_commit()
The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones): 

   In the case of a data=ordered umount with pending long symlinks
   which are delayed due to a long list of other I/O on the backing
   block device, this causes the buffer associated with the long
   symlinks to not be moved to the inode dirty list in the second
   phase of fsync_super.  Then, before they can be dirtied again,
   kjournald exits, seeing the UMOUNT flag and the dirty pages are
   never written to the backing block device, causing long symlink
   corruption and exposing new or previously freed block data to
   userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
2009-02-10 11:27:46 -05:00
Linus Torvalds 4c098bcd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: don't use spin_is_contended
2009-02-09 14:00:16 -08:00
Chris Mason 284b066af4 Btrfs: don't use spin_is_contended
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot.  The idea was to avoid
expensive searches in the tree unless the lock was actually contended.

But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.

In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-09 16:22:03 -05:00
Linus Torvalds 896abeb743 Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux:
  lockd: fix regression in lockd's handling of blocked locks
2009-02-09 10:30:19 -08:00
J. Bruce Fields 9d9b87c121 lockd: fix regression in lockd's handling of blocked locks
If a client requests a blocking lock, is denied, then requests it again,
then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP
set, because we've already queued a block and don't need the locks code
to do it again.

But that means vfs_lock_file() will return -EAGAIN instead of
FILE_LOCK_DENIED.  So we still need to translate that -EAGAIN return
into a nlm_lck_blocked error in this case, and put ourselves back on
lockd's block list.

The bug was introduced by bde74e4bc6 "locks: add special return
value for asynchronous locks".

Thanks to Frank van Maarseveen for the report; his original test
case was essentially

	for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done

Tested-by: Frank van Maarseveen <frankvm@frankvm.com>
Reported-by: Frank van Maarseveen <frankvm@frankvm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-02-09 13:19:46 -05:00
Cornelia Huck 766ccb9ed4 async: Rename _special -> _domain for clarity.
Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:11 -08:00
Linus Torvalds ccfef64621 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  CRED: Fix SUID exec regression
2009-02-06 18:52:55 -08:00
Linus Torvalds ae1a25da84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits)
  Btrfs: Make sure dir is non-null before doing S_ISGID checks
  Btrfs: Fix memory leak in cache_drop_leaf_ref
  Btrfs: don't return congestion in write_cache_pages as often
  Btrfs: Only prep for btree deletion balances when nodes are mostly empty
  Btrfs: fix btrfs_unlock_up_safe to walk the entire path
  Btrfs: change btrfs_del_leaf to drop locks earlier
  Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
  Btrfs: Don't try to compress pages past i_size
  Btrfs: join the transaction in __btrfs_setxattr
  Btrfs: Handle SGID bit when creating inodes
  Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
  Btrfs: Change btree locking to use explicit blocking points
  Btrfs: hash_lock is no longer needed
  Btrfs: disable leak debugging checks in extent_io.c
  Btrfs: sort references by byte number during btrfs_inc_ref
  Btrfs: async threads should try harder to find work
  Btrfs: selinux support
  Btrfs: make btrfs acls selectable
  Btrfs: Catch missed bios in the async bio submission thread
  Btrfs: fix readdir on 32 bit machines
  ...
2009-02-06 18:37:22 -08:00
Tyler Hicks fd9fc842bb eCryptfs: Regression in unencrypted filename symlinks
The addition of filename encryption caused a regression in unencrypted
filename symlink support.  ecryptfs_copy_filename() is used when dealing
with unencrypted filenames and it reported that the new, copied filename
was a character longer than it should have been.

This caused the return value of readlink() to count the NULL byte of the
symlink target.  Most applications don't care about the extra NULL byte,
but a version control system (bzr) helped in discovering the bug.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 18:36:40 -08:00
Linus Torvalds 1d87b0d388 Merge branch 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland
* 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
  elf core dump: fix get_user use
2009-02-06 18:10:04 -08:00
Roland McGrath 92dc07b1f9 elf core dump: fix get_user use
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.

Checking for VM_READ optimizes out the case where get_user() would fail
anyway.  The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
2009-02-06 17:34:07 -08:00
David Howells 0bf2f3aec5 CRED: Fix SUID exec regression
The patch:

	commit a6f76f23d2
	CRED: Make execve() take advantage of copy-on-write credentials

moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
calculated incorrectly.  This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made.  All of which are true for threads created by the
pthread library.

However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

We do have to be careful because CLONE_THREAD does not imply FS or FILES.

We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.

This can be tested with the attached pair of programs.  Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user.  If
successful, you should see something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=0 suid=0
	SUCCESS - Correct effective user ID

and if unsuccessful, something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=4043 suid=4043
	ERROR - Incorrect effective user ID!

The non-root user ID you see will depend on the user you run as.

[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

static void *thread_func(void *arg)
{
	while (1) {}
}

int main(int argc, char **argv)
{
	pthread_t tid;
	uid_t uid, euid, suid;

	printf("--TEST1--\n");
	getresuid(&uid, &euid, &suid);
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
		perror("pthread_create");
		exit(1);
	}

	printf("exec ./test2\n");
	execlp("./test2", "test2", NULL);
	perror("./test2");
	_exit(1);
}

[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	uid_t uid, euid, suid;

	getresuid(&uid, &euid, &suid);
	printf("--TEST2--\n");
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (euid != 0) {
		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
		exit(1);
	}
	printf("SUCCESS - Correct effective user ID\n");
	exit(0);
}

[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2

test1: test1.c
	gcc $(CFLAGS) -o test1 test1.c -lpthread

test2: test2.c
	gcc $(CFLAGS) -o test2 test2.c
	sudo chown root.root test2
	sudo chmod +s test2

Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-07 08:46:18 +11:00
Dave Kleikamp d4cf109f05 vfs: Don't call attach_nobh_buffers() with an empty list
This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>

nobh_write_end() could call attach_nobh_buffers() with head == NULL.
This would result in a trap when attach_nobh_buffers() attempted to
access bh->b_this_page.

This can be illustrated by running the writev01 testcase from LTP on jfs.

This error was introduced by commit 5b41e74a "vfs: fix data leak in
nobh_write_end()".  That patch did not take into account that if
PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
buffers will be allocated for the page.  In that case, we won't have to
worry about a failed write leaving unitialized data in the page.

Of course, head != NULL implies !page_has_buffers(page), so no need to
test both.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Dmitri Monakhov <dmonakhov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 13:34:22 -08:00
Theodore Ts'o e187c6588d ext4: remove call to ext4_group_desc() in ext4_group_used_meta_blocks()
The static function ext4_group_used_meta_blocks() only has one caller,
who already has access to the block group's group descriptor.  So it's
better to have ext4_init_block_bitmap() pass the group descriptor to
ext4_group_used_meta_blocks(), so it doesn't need to call
ext4_group_desc().  Previously this function did not check if
ext4_group_desc() returned NULL due to an error, potentially causing a
kernel OOPS report.  This avoids the issue entirely.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-06 16:23:37 -05:00
Mike Snitzer 074ca44283 ext4: Remove stale block allocator references from ext4.h
Remove some leftovers from when the old block allocator was removed
(c2ea3fde).  ext4_sb_info is now a bit lighter.  Also remove a dangling
read_block_bitmap() prototype.

Signed-off-by: Mike Snitzer <snitzer@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-06 16:23:37 -05:00
Chris Mason 42f15d77df Btrfs: Make sure dir is non-null before doing S_ISGID checks
The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-06 11:35:57 -05:00
Pablo Neira Ayuso ff491a7334 netlink: change return-value logic of netlink_broadcast()
Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
   failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:

1) If it fails to deliver any message (for whatever reason), return
   -ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.

BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 23:56:36 -08:00
Herbert Xu 33dccbb050 tun: Limit amount of queued packets per device
Unlike a normal socket path, the tuntap device send path does
not have any accounting.  This means that the user-space sender
may be able to pin down arbitrary amounts of kernel memory by
continuing to send data to an end-point that is congested.

Even when this isn't an issue because of limited queueing at
most end points, this can also be a problem because its only
response to congestion is packet loss.  That is, when those
local queues at the end-point fills up, the tuntap device will
start wasting system time because it will continue to send
data there which simply gets dropped straight away.

Of course one could argue that everybody should do congestion
control end-to-end, unfortunately there are people in this world
still hooked on UDP, and they don't appear to be going away
anywhere fast.  In fact, we've always helped them by performing
accounting in our UDP code, the sole purpose of which is to
provide congestion feedback other than through packet loss.

This patch attempts to apply the same bandaid to the tuntap device.
It creates a pseudo-socket object which is used to account our
packets just as a normal socket does for UDP.  Of course things
are a little complex because we're actually reinjecting traffic
back into the stack rather than out of the stack.

The stack complexities however should have been resolved by preceding
patches.  So this one can simply start using skb_set_owner_w.

For now the accounting is essentially disabled by default for
backwards compatibility.  In particular, we set the cap to INT_MAX.
This is so that existing applications don't get confused by the
sudden arrival EAGAIN errors.

In future we may wish (or be forced to) do this by default.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 21:25:32 -08:00
Al Viro 767b5828ad braino in sg_ioctl_trans()
... and yes, gcc is insane enough to eat that without complaint.
We probably want sparse to scream on those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 16:35:52 -08:00
Linus Torvalds 082256333f Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"
2009-02-05 16:12:38 -08:00
James Morris cb5629b10d Merge branch 'master' into next
Conflicts:
	fs/namei.c

Manually merged per:

diff --cc fs/namei.c
index 734f2b5,bbc15c2..0000000
--- a/fs/namei.c
+++ b/fs/namei.c
@@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
  		nd->flags |= LOOKUP_CONTINUE;
  		err = exec_permission_lite(inode);
  		if (err == -EAGAIN)
- 			err = vfs_permission(nd, MAY_EXEC);
+ 			err = inode_permission(nd->path.dentry->d_inode,
+ 					       MAY_EXEC);
 +		if (!err)
 +			err = ima_path_check(&nd->path, MAY_EXEC);
   		if (err)
  			break;

@@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
  		flag &= ~O_TRUNC;
  	}

- 	error = vfs_permission(nd, acc_mode);
+ 	error = inode_permission(inode, acc_mode);
  	if (error)
  		return error;
 +
- 	error = ima_path_check(&nd->path,
++	error = ima_path_check(path,
 +			       acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
 +	if (error)
 +		return error;
  	/*
  	 * An append-only file must be opened in append mode for writing.
  	 */

Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 11:01:45 +11:00
Alexey Dobriyan f01d1d546a seq_file: fix big-enough lseek() + read()
lseek() further than length of the file will leave stale ->index
(second-to-last during iteration). Next seq_read() will not notice
that ->f_pos is big enough to return 0, but will print last item
as if ->f_pos is pointing to it.

Introduced in commit cb510b8172
aka "seq_file: more atomicity in traverse()".

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 14:18:14 -08:00
Mimi Zohar 6146f0d5e4 integrity: IMA hooks
This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-06 09:05:30 +11:00
Eric Biederman 33da8892a2 seq_file: move traverse so it can be used from seq_read
In 2.6.25 some /proc files were converted to use the seq_file
infrastructure.  But seq_files do not correctly support pread(), which
broke some usersapce applications.

To handle pread correctly we can't assume that f_pos is where we left it
in seq_read.  So move traverse() so that we can eventually use it in
seq_read and do thus some day support pread().

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Turner <pjt@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:49 -08:00
Chris Mason 806638bce9 Btrfs: Fix memory leak in cache_drop_leaf_ref
The code wasn't doing a kfree on the sorted array

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-05 09:08:14 -05:00
Mark Fasheh 436443f0f7 Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"
This reverts commit 0e0333429a.

I committed this by accident - Joel and Louis are working with the lockdep
maintainer to provide a better solution than just turning lockdep off.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: <Joel Becker <joel.becker@oracle.com>
2009-02-04 09:46:25 -08:00
Chris Mason 9b0d3ace33 Btrfs: don't return congestion in write_cache_pages as often
On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.

For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down.  Longer term a real fix is needed
for pdflush, but that is a larger project.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:33:00 -05:00
Chris Mason 7b78c170dc Btrfs: Only prep for btree deletion balances when nodes are mostly empty
Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted.  This balance prep happens from the root of the tree down
so we can drop our locks as we go.

reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required.  This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:12:46 -05:00
Chris Mason 12f4daccfc Btrfs: fix btrfs_unlock_up_safe to walk the entire path
btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.

Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:42 -05:00
Chris Mason 4d081c41a4 Btrfs: change btrfs_del_leaf to drop locks earlier
btrfs_del_leaf does two things.  First it removes the pointer in the
parent, and then it frees the block that has the leaf.  It has the
parent node locked for both operations.

But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:28 -05:00
Chris Mason 06d9a8d7c2 Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
btrfs_truncate_inode_items is setup to stop doing btree searches when
it has finished removing the items for the inode.  It used to detect the
end of the inode by looking for an objectid that didn't match the
one we were searching for.

But, this would result in an extra search through the btree, which
adds extra balancing and cow costs to the operation.

This commit adds a check to see if we found the inode item, which means
we can stop searching early.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:30:58 -05:00
Chris Mason f03d9301f1 Btrfs: Don't try to compress pages past i_size
The compression code had some checks to make sure we were only
compressing bytes inside of i_size, but it wasn't catching every
case.  To make things worse, some incorrect math about the number
of bytes remaining would make it try to compress more pages than the
file really had.

The fix used here is to fall back to the non-compression code in this
case, which does all the proper cleanup of delalloc and other accounting.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:06 -05:00
Josef Bacik 811449496b Btrfs: join the transaction in __btrfs_setxattr
With selinux on we end up calling __btrfs_setxattr when we create an inode,
which calls btrfs_start_transaction().  The problem is we've already called
that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
wait_current_trans().  If btrfs-transaction has started committing it will wait
for all handles to finish, while the other process is waiting for the
transaction to commit.  This is fixed by using btrfs_join_transaction, which
won't wait for the transaction to commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-04 09:18:33 -05:00
Chris Ball 8c087b5183 Btrfs: Handle SGID bit when creating inodes
Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:29:54 -05:00
Chris Mason bd56b30205 Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:27:02 -05:00
Chris Mason b4ce94de9b Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
Chris Mason c487685d7c Btrfs: hash_lock is no longer needed
Before metadata is written to disk, it is updated to reflect that writeout
has begun.  Once this update is done, the block must be cow'd before it
can be modified again.

This update was originally synchronized by using a per-fs spinlock.  Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.

So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:24:25 -05:00
Chris Mason 3935127c50 Btrfs: disable leak debugging checks in extent_io.c
extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time.  This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.

But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent.  Some use
#ifdef and some #if.

This changes everything to #if and disables the leak checking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:24:05 -05:00
Chris Mason b7a9f29fcf Btrfs: sort references by byte number during btrfs_inc_ref
When a block goes through cow, we update the reference counts of
everything that block points to.  The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.

To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:45 -05:00
Chris Mason b51912c91f Btrfs: async threads should try harder to find work
Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short.  This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.

It allows more work to be added to the worker
without requiring notifications from other procs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:24 -05:00
Jim Owens 0279b4cd86 Btrfs: selinux support
Add call to LSM security initialization and save
resulting security xattr for new inodes.

Add xattr support to symlink inode ops.

Set inode->i_op for existing special files.

Signed-off-by: jim owens <jowens@hp.com>
2009-02-04 09:29:13 -05:00
Christian Hesse bef62ef339 Btrfs: make btrfs acls selectable
This patch adds a menu entry to kconfig to enable acls for btrfs.
This allows you to enable FS_POSIX_ACL at kernel compile time.

(updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)

Signed-off-by: Christian Hesse <mail@earthworm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2009-02-04 09:28:28 -05:00
Chris Mason a683705153 Btrfs: Catch missed bios in the async bio submission thread
The async bio submission thread was missing some bios that were
added after it had decided there was no work left to do.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:19:41 -05:00
Linus Torvalds f96c08e8c5 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: remove fast unmounting
  UBIFS: return sensible error codes
  UBIFS: remount ro fixes
  UBIFS: spelling fix 'date' -> 'data'
  UBIFS: sync wbufs after syncing inodes and pages
  UBIFS: fix LPT out-of-space bug (again)
  UBIFS: fix no_chk_data_crc
  UBIFS: fix assertions
  UBIFS: ensure orphan area head is initialized
  UBIFS: always clean up GC LEB space
  UBIFS: add re-mount debugging checks
  UBIFS: fix LEB list freeing
  UBIFS: simplify locking
  UBIFS: document dark_wm and dead_wm better
  UBIFS: do not treat all data as short term
  UBIFS: constify operations
  UBIFS: do not commit twice
2009-02-03 16:52:44 -08:00
Linus Torvalds 3e1c400513 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: add quota call to ocfs2_remove_btree_range()
  ocfs2: Wakeup the downconvert thread after a successful cancel convert
  ocfs2: Access the xattr bucket only before modifying it.
  configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()
  ocfs2: Fix possible deadlock in ocfs2_write_dquot()
  ocfs2: Push out dropping of dentry lock to ocfs2_wq
2009-02-03 16:50:20 -08:00
Felix Blyakher 43f3f057c5 [XFS] Warn on transaction in flight on read-only remount
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-02-03 11:04:54 -06:00
Dave Chinner 6139a23609 xfs: Check buffer lengths in log recovery
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-02-03 11:01:32 -06:00
Felix Blyakher 6d2160bfe7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus 2009-02-03 10:38:41 -06:00
Steve French e1f81c8a41 Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-02-03 15:19:23 +00:00
Mark Fasheh fd4ef23196 ocfs2: add quota call to ocfs2_remove_btree_range()
We weren't reclaiming the clusters which get free'd from this function,
so any user punching holes in a file would still have those bytes accounted
against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this.
Interestingly enough, the journal credits calculation already took this into
account.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
2009-02-02 14:20:20 -08:00
Sunil Mushran a4b91965d3 ocfs2: Wakeup the downconvert thread after a successful cancel convert
When two nodes holding PR locks on a resource concurrently attempt to
upconvert the locks to EX, the master sends a BAST to one of the nodes. This
message tells that node to first cancel convert the upconvert request,
followed by downconvert to a NL. Only when this lock is downconverted to NL,
can the master upconvert the first node's lock to EX.

While the fs was doing the cancel convert, it was forgetting to wake up the
dc thread after a successful cancel, leading to a deadlock.

Reported-and-Tested-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:19 -08:00
Tao Ma 554e7f9e04 ocfs2: Access the xattr bucket only before modifying it.
In ocfs2_xattr_value_truncate, we may call b-tree codes which will
extend the journal transaction. It has a potential problem that it
may let the already-accessed-but-not-dirtied buffers gone. So we'd
better access the bucket after we call ocfs2_xattr_value_truncate.
And as for the root buffer for the xattr value, b-tree code will
acess and dirty it, so we don't need to worry about it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:18 -08:00
Joel Becker 0e0333429a configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:

- the VFS allows one to lock first an inode and second one of its
  children (The lock subclasses for this pattern are respectively
  I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
  descending order as long as it stays under a single mountpoint and
  does not follow symlinks.

Unfortunately lockdep does not know (yet?) how to handle such
recursion.

I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.

>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.

I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
   lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
   i_mutexes recursively locked in different classes based on their
   depth from the top-level config_group created. This
   induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
   nesting of configfs default groups whenever lockdep is activated
   but this limit looks reasonably high. Unfortunately, this alos
   isolates VFS operations on configfs default groups from the others
   and thus lowers the chances to detect locking issues.

This patch implements solution 1).

Solution 2) looks better from lockdep's point of view, but fails with
configfs_depend_item(). This needs to rework the locking
scheme of configfs_depend_item() by removing the variable lock recursion
depth, and I think that it's doable thanks to the configfs_dirent_lock.
For now, let's stick to solution 1).

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:18 -08:00
Jan Kara f8afead716 ocfs2: Fix possible deadlock in ocfs2_write_dquot()
It could happen that some limit has been set via quotactl() and in parallel
->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In
such case ocfs2_write_dquot() must not try to sync the dquot because that needs
global quota lock but that ranks above transaction start.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:17 -08:00
Jan Kara ea455f8ab6 ocfs2: Push out dropping of dentry lock to ocfs2_wq
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:16 -08:00
Randy Dunlap c68a65da35 jfs: needs crc32_le
JFS needs crc32_le(), so select its library config symbol:

fs/built-in.o: In function `jfs_statfs':
super.c:(.text+0x7c8c0): undefined reference to `crc32_le'
super.c:(.text+0x7c8d5): undefined reference to `crc32_le'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2009-02-02 13:43:28 -06:00
Dave Kleikamp 8db0c5d5ef Merge branch 'master' of /home/shaggy/git/linus-clean/ 2009-02-02 13:40:55 -06:00
Steve French 0e2bedaa39 [CIFS] ipv6_addr_equal for address comparison
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-30 21:24:41 +00:00
Dave Kleikamp 1ad53a98c9 jfs: Fix error handling in metapage_writepage()
Improved error handling so that last_write_complete(), and thus
end_page_writeback(), gets called only once.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
2009-01-30 14:09:06 -06:00
Linus Torvalds c01a25e7cf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Remove bogus BUG() check in ext4_bmap()
  ext4: Fix building with EXT4FS_DEBUG
  ext4: Initialize the new group descriptor when resizing the filesystem
  ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks
  jbd2: On a __journal_expect() assertion failure printk "JBD2", not "EXT3-fs"
  ext3: Add sanity check to make_indexed_dir
  ext4: Add sanity check to make_indexed_dir
  ext4: only use i_size_high for regular files
  ext4: fix wrong use of do_div
2009-01-30 08:54:29 -08:00
Linus Torvalds ae704e9f92 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: Allow RT requests to pre-empt ongoing BE timeslice
  block: add sysfs file for controlling io stats accounting
  Mark mandatory elevator functions in the biodoc.txt
  include/linux: Add bsg.h to the Kernel exported headers
  block: silently error an unsupported barrier bio
  block: Fix documentation for blkdev_issue_flush()
  block: add bio_rw_flagged() for testing bio->bi_rw
  block: seperate bio/request unplug and sync bits
  block: export SSD/non-rotational queue flag through sysfs
  Fix small typo in bio.h's documentation
  block: get rid of the manual directory counting in blktrace
  block: Allow empty integrity profile
  block: Remove obsolete BUG_ON
  block: Don't verify integrity metadata on read error
2009-01-30 08:46:42 -08:00
Linus Torvalds dbeb17016e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits)
  tulip: fix 21142 with 10Mbps without negotiation
  drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic
  gianfar: Fix Wake-on-LAN support
  smsc911x: timeout reaches -1
  smsc9420: fix interrupt signalling test failures
  ucc_geth: Change uec phy id to the same format as gianfar's
  wimax: fix build issue when debugfs is disabled
  netxen: fix memory leak in drivers/net/netxen_nic_init.c
  tun: Add some missing TUN compat ioctl translations.
  ipv4: fix infinite retry loop in IP-Config
  net: update documentation ip aliases
  net: Fix OOPS in skb_seq_read().
  net: Fix frag_list handling in skb_seq_read
  netxen: revert jumbo ringsize
  ath5k: fix locking in ath5k_config
  cfg80211: print correct intersected regulatory domain
  cfg80211: Fix sanity check on 5 GHz when processing country IE
  iwlwifi: fix kernel oops when ucode DMA memory allocation failure
  rtl8187: Fix error in setting OFDM power settings for RTL8187L
  mac80211: remove Michael Wu as maintainer
  ...
2009-01-30 08:41:36 -08:00
Martin K. Petersen 8ae372e3bb block: Remove obsolete BUG_ON
Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following
BUG_ON must go.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Martin K. Petersen 7b24fc4d7e block: Don't verify integrity metadata on read error
If we get an I/O error on a read request there is no point in doing a
verify pass on the integrity buffer.  Adjust the completion path
accordingly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Theodore Ts'o b9ec63f78b ext4: Remove bogus BUG() check in ext4_bmap()
The code to support journal-less ext4 operation added a BUG to
ext4_bmap() which fired if there was no journal and the
EXT4_STATE_JDATA bit was set in the i_state field.  This caused
running the filefrag program (which uses the FIMBAP ioctl) to trigger
a BUG().

The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
harmless for the bit to be set.  We could add a check in
__ext4_journalled_writepage() and ext4_journalled_write_end() to only
set the EXT4_STATE_JDATA bit if the journal is present, but that adds
an extra test and jump instruction.  It's easier to simply remove the
BUG check.

http://bugzilla.kernel.org/show_bug.cgi?id=12568

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-30 00:00:24 -05:00
Linus Torvalds f2257b70b0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: make sure we allocate enough storage for socket address
  [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
  [CIFS] some cleanup to dir.c prior to addition of posix_open
  [CIFS] revalidate parent inode when rmdir done within that directory
  [CIFS] Rename md5 functions to avoid collision with new rt modules
  cifs: turn smb_send into a wrapper around smb_sendv
2009-01-29 18:21:14 -08:00
Davide Libenzi 9df04e1f25 epoll: drop max_user_instances and rely only on max_user_watches
Linus suggested to put limits where the money is, and max_user_watches
already does that w/out the need of max_user_instances.  That has the
advantage to mitigate the potential DoS while allowing pretty generous
default behavior.

Allowing top 4% of low memory (per user) to be allocated in epoll watches,
we have:

LOMEM    MAX_WATCHES (per user)
512MB    ~178000
1GB      ~356000
2GB      ~712000

A box with 512MB of lomem, will meet some challenge in hitting 180K
watches, socket buffers math teaches us.  No more max_user_instances
limits then.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
David S. Miller df1c46b2b6 tun: Add some missing TUN compat ioctl translations.
Based upon a report from Michael Tokarev <mjt@tls.msk.ru>:

	Just saw in dmesg:

	ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:53:35 -08:00
Artem Bityutskiy 27ad279933 UBIFS: remove fast unmounting
This UBIFS feature has never worked properly, and it was a mistake
to add it because we simply have no use-cases. So, lets still accept
the fast_unmount mount option, but ignore it. This does not change
much, because UBIFS commit in sync_fs anyway, and sync_fs is called
while unmounting.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:34:30 +02:00
Artem Bityutskiy a2b9df3ff6 UBIFS: return sensible error codes
When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC
or EROFS codes are are much better, just because we have not found
references to ENOSPC/EROFS in mount (2) man pages. This patch
changes this behaviour and makes UBIFS return real error code,
because:

1. It is just less confusing and more logical
2. mount is not described in SuSv3, so it seems to be not really
   well-standartized
3. we do not cover all cases, and any random undocumented in man
   pages error code may be returned anyway

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:22:54 +02:00
Adrian Hunter b466f17d78 UBIFS: remount ro fixes
- preserve the idx_gc list - it will be needed in the same
state, should UBIFS be remounted rw again
- prevent remounting ro if we have switched to read only
mode (due to a fatal error)

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:19:36 +02:00
Adrian Hunter 227c75c91d UBIFS: spelling fix 'date' -> 'data'
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:51 +02:00
Adrian Hunter 3eb14297c4 UBIFS: sync wbufs after syncing inodes and pages
All writes go through wbufs so they must be sync'd
after syncing inodes and pages.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:39 +02:00
Jeff Layton a9ac49d303 cifs: make sure we allocate enough storage for socket address
The sockaddr declared on the stack in cifs_get_tcp_session is too small
for IPv6 addresses. Change it from "struct sockaddr" to "struct
sockaddr_storage" to prevent stack corruption when IPv6 is used.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French da505c386c [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
We have used approximately 15 second timeouts on nonblocking sends in the past, and
also 15 second SMB timeout (waiting for server responses, for most request types).
Now that we can do blocking tcp sends,
make blocking send timeout approximately the same (15 seconds).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French f818dd55c4 [CIFS] some cleanup to dir.c prior to addition of posix_open
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French 42c245447c [CIFS] revalidate parent inode when rmdir done within that directory
When a search is pending of a parent directory, and a child directory
within it is removed, we need to reset the parent directory's time
so that we don't reuse the (now stale) search results.

Thanks to Gunter Kukkukk for reporting this:

> got the following failure notification on irc #samba:
>
> A user was updating from subversion 1.4 to 1.5, where the
> repository is located on a samba share (independent of
> unix extensions = Yes or No).
> svn 1.4 did work, 1.5 does not.
>
> The user did a lot of stracing of subversion - and wrote a
> testapplet to simulate the failing behaviour.
> I've converted the C++ source to C and added some error cases.
>
> When using "./testdir" on a local file system, "result2"
> is always (nil) as expected - cifs vfs behaves different here!
>
>   ./testdir /mnt/cifs/mounted/share
>
> returns a (failing) valid pointer.

Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Steve French 6a7f8d36c0 [CIFS] Rename md5 functions to avoid collision with new rt modules
When rt modules were added they (each) included their own md5
with names which collided with the existing names of cifs's md5 functions.

Renaming cifs's md5 modules so we don't collide with them.

> Stephen Rothwell wrote:
> When CIFS is built-in (=y) and staging/rt28[67]0 =y, there are multiple
> definitions of:
>
> build-r8250.out:(.text+0x1d8ad0): multiple definition of `MD5Init'
> build-r8250.out:(.text+0x1dbb30): multiple definition of `MD5Update'
> build-r8250.out:(.text+0x1db9b0): multiple definition of `MD5Final'
>
> all of which need to have more unique identifiers for their global
> symbols (e.g., rt28_md5_init, cifs_md5_init, foo, blah, bar).
>

CC: Greg K-H <gregkh@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Jeff Layton 0496e02d87 cifs: turn smb_send into a wrapper around smb_sendv
cifs: turn smb_send into a wrapper around smb_sendv

Rename smb_send2 to smb_sendv to make it consistent with kernel naming
conventions for functions that take a vector.

There's no need to have 2 functions to handle sending SMB calls. Turn
smb_send into a wrapper around smb_sendv. This also allows us to
properly mark the socket as needing to be reconnected when there's a
partial send from smb_send.

Also, in practice we always use the address and noblocksnd flag
that's attached to the TCP_Server_Info. There's no need to pass
them in as separate args to smb_sendv.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Chris Mason 89f135d8b5 Btrfs: fix readdir on 32 bit machines
After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int.  This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.

It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.

The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.

This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-28 15:34:27 -05:00
Chris Mason e4f722fa42 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
Fix fs/btrfs/super.c conflict around #includes
2009-01-28 20:29:43 -05:00
Joe Perches 2cf12c0bf2 dlm: comment typo fixes
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-28 12:56:07 -06:00
Joe Perches 44ad532b32 dlm: use ipv6_addr_copy
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-28 12:56:02 -06:00
Steven Whitehouse 305a47b17c dlm: Change rwlock which is only used in write mode to a spinlock
The ls_dirtbl[].lock was an rwlock, but since it was only used in write
mode a spinlock will suffice.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-28 12:55:55 -06:00
Adrian Hunter 4a29d2005b UBIFS: fix LPT out-of-space bug (again)
The function to traverse and dirty the LPT was still not
dirtying all nodes, with the result that the LPT could
run out of space.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-28 16:02:07 +02:00
Jeff Layton fa82a49127 nfsd: only set file_lock.fl_lmops in nfsd4_lockt if a stateowner is found
nfsd4_lockt does a search for a lockstateowner when building the lock
struct to test. If one is found, it'll set fl_owner to it. Regardless of
whether that happens, it'll also set fl_lmops. Given that this lock is
basically a "lightweight" lock that's just used for checking conflicts,
setting fl_lmops is probably not appropriate for it.

This behavior exposed a bug in DLM's GETLK implementation where it
wasn't clearing out the fields in the file_lock before filling in
conflicting lock info. While we were able to fix this in DLM, it
still seems pointless and dangerous to set the fl_lmops this way
when we may have a NULL lockstateowner.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@pig.fieldses.org>
2009-01-27 17:26:59 -05:00
J. Bruce Fields b914152a6f nfsd: fix cred leak on every rpc
Since override_creds() took its own reference on new, we need to release
our own reference.

(Note the put_cred on the return value puts the *old* value of
current->creds, not the new passed-in value).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-01-27 17:26:59 -05:00
J. Bruce Fields bf935a7881 nfsd: fix null dereference on error path
We're forgetting to check the return value from groups_alloc().

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-01-27 17:26:58 -05:00
Eric Sandeen f0e0059b9c don't reallocate sxp variable passed into xfs_swapext
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels

Regression caused by 743bb4650d

This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-01-27 14:51:39 -06:00
Coly Li b5c816a4f1 jfs: return f_fsid for statfs(2)
This patch makes jfs return f_fsid info for statfs(2). By Andreas'
suggestion, this patch populates a persistent f_fsid between boots/mounts
with help of on-disk uuid record.

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2009-01-27 10:56:14 -06:00
Artem Bityutskiy 6f7ab6d458 UBIFS: fix no_chk_data_crc
When data CRC checking is disabled, UBIFS returns incorrect return
code from the 'try_read_node()' function (0 instead of 1, which means
CRC error), which make the caller re-read the data node again, but using
a different code patch, so the second read is fine. Thus, we read the
same node twice. And the result of this is that UBIFS is slower
with no_chk_data_crc option than it is with chk_data_crc option.
This patches fixes the problem.

Reported-by: Reuben Dowle <Reuben.Dowle@navico.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-27 16:25:10 +02:00
Thadeu Lima de Souza Cascardo 9fd9784c91 ext4: Fix building with EXT4FS_DEBUG
When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.

Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.

This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-26 19:26:26 -05:00
Theodore Ts'o fdff73f094 ext4: Initialize the new group descriptor when resizing the filesystem
Make sure all of the fields of the group descriptor are properly
initialized.  Previously, we allowed bg_flags field to be contain
random garbage, which could trigger non-deterministic behavior,
including a kernel OOPS.

http://bugzilla.kernel.org/show_bug.cgi?id=12433

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-26 19:06:41 -05:00
Linus Torvalds a90e8a75fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: initialize file_lock struct in GETLK before copying conflicting lock
  dlm: fix plock notify callback to lockd
2009-01-26 10:42:05 -08:00
Linus Torvalds cc597bc3d3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6:
  ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop()
  quota: Improve locking
2009-01-26 10:41:00 -08:00
Linus Torvalds ed80386295 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  klist.c: bit 0 in pointer can't be used as flag
  debugfs: introduce stub for debugfs_create_size_t() when DEBUG_FS=n
  sysfs: fix problems with binary files
  PNP: fix broken pnp lowercasing for acpi module aliases
  driver core: Convert '/' to '!' in dev_set_name()
2009-01-26 10:40:28 -08:00
Linus Torvalds a1c70a756f Merge branch 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc
* 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits)
  fs/Kconfig: move 9p out
  fs/Kconfig: move afs out
  fs/Kconfig: move coda out
  fs/Kconfig: move the rest of ncpfs out
  fs/Kconfig: move smbfs out
  fs/Kconfig: move sunrpc out
  fs/Kconfig: move nfsd out
  fs/Kconfig: move nfs out
  fs/Kconfig: move ufs out
  fs/Kconfig: move sysv out
  fs/Kconfig: move romfs out
  fs/Kconfig: move qnx4 out
  fs/Kconfig: move hpfs out
  fs/Kconfig: move omfs out
  fs/Kconfig: move minix out
  fs/Kconfig: move vxfs out
  fs/Kconfig: move squashfs out
  fs/Kconfig: move cramfs out
  fs/Kconfig: move efs out
  fs/Kconfig: move bfs out
  ...
2009-01-26 10:08:50 -08:00
Vegard Nossum 3632dee2f8 inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.

The best fix (contributed largely by Linus) is a total rewrite
of the function in question:

On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
>  - locking is done in just one place, and there is no question about it
>   not having an unlock.
>
>  - that whole double-while(1)-loop thing is gone.
>
>  - use multiple functions to make nesting and error handling sane
>
>  - do error testing after doing the things you always need to do, ie do
>   this:
>
>        mutex_lock(..)
>        ret = function_call();
>        mutex_unlock(..)
>
>        .. test ret here ..
>
>   instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.

Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-26 10:08:05 -08:00
Linus Torvalds 2d07d4d1bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix poll notify
  fuse: destroy bdi on umount
  fuse: fuse_fill_super error handling cleanup
  fuse: fix missing fput on error
  fuse: fix NULL deref in fuse_file_alloc()
2009-01-26 09:49:22 -08:00
Artem Bityutskiy 6ba87c9b92 UBIFS: fix assertions
I introduce wrong assertions in one of the previous commits, this
patch fixes them.

Also, initialize debugfs after the debugging check. This is a little
nicer because we want the FS data to be accessible to external users
after everything has been initialized.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 18:22:47 +02:00
Miklos Szeredi f6d47a1761 fuse: fix poll notify
Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
This is not a big issue because fuse_notify_poll_wakeup() should be
atomic, but it's cleaner this way, and later uses of notification will
need to be able to finish the copying before performing some actions.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-01-26 15:00:59 +01:00
Miklos Szeredi 26c3679101 fuse: destroy bdi on umount
If a fuse filesystem is unmounted but the device file descriptor
remains open and a new mount reuses the old device number, then the
mount fails with EEXIST and the following warning is printed in the
kernel log:

  WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
  sysfs: duplicate filename '0:15' can not be created

The cause is that the bdi belonging to the fuse filesystem was
destoryed only after the device file was released.  Fix this by
calling bdi_destroy() from fuse_put_super() instead.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:59 +01:00
Miklos Szeredi c2b8f00690 fuse: fuse_fill_super error handling cleanup
Clean up error handling for the whole of fuse_fill_super() function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-01-26 15:00:58 +01:00
Miklos Szeredi 3ddf1e7f57 fuse: fix missing fput on error
Fix the leaking file reference if allocation or initialization of
fuse_conn failed.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:58 +01:00
Dan Carpenter bb875b38dc fuse: fix NULL deref in fuse_file_alloc()
ff is set to NULL and then dereferenced on line 65.  Compile tested only.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:58 +01:00
Adrian Hunter 49d128aa60 UBIFS: ensure orphan area head is initialized
When mounting read-only the orphan area head is
not initialized.  It must be initialized when
remounting read/write, but it was not.  This patch
fixes that.

[Artem: sorry, added comment tweaking noise]
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy b4978e9491 UBIFS: always clean up GC LEB space
When we mount UBIFS, GC LEB may contain out-of-date information,
and UBIFS should update lprops and set free space for thei LEB.
Currently UBIFS does this only if mounted R/W. But for R/O mount
we have to do the same, because otherwise we will have incorrect
FS free space reported to user-space.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy 84abf972cc UBIFS: add re-mount debugging checks
We observe space corrupted accounting when re-mounting. So add some
debbugging checks to catch problems like this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy e4d9b6cbfc UBIFS: fix LEB list freeing
When freeing the c->idx_lebs list, we have to release the LEBs as well,
because we might be called from mount to read-only mode code. Otherwise
the LEBs stay taken forever, which may cause problems when we re-mount
back ro RW mode.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy 82c1593cad UBIFS: simplify locking
This patch simplifies lock_[23]_inodes functions. We do not have
to care about locking order, because UBIFS does this for @i_mutex
and this is enough. Thanks to Al Viro for suggesting this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Chris Mason a717531942 Btrfs: do less aggressive btree readahead
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too.  It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.

This patch changes things to just look for blocks within 64k of the
target block.  It will trigger less IO and make for lower latencies on
the read size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-22 09:23:10 -05:00
Alexey Dobriyan 0fcb440889 fs/Kconfig: move 9p out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan b2480c7fbf fs/Kconfig: move afs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan 33a1a6fedf fs/Kconfig: move coda out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan 9d7d6447ef fs/Kconfig: move the rest of ncpfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan 213a41d404 fs/Kconfig: move smbfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan 9098c24f35 fs/Kconfig: move sunrpc out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan e2b329e200 fs/Kconfig: move nfsd out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan 97afe47ac3 fs/Kconfig: move nfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan a276a52f9f fs/Kconfig: move ufs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan 8af915ba1d fs/Kconfig: move sysv out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan 41810246df fs/Kconfig: move romfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan 4c7415830c fs/Kconfig: move qnx4 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan 928ea19295 fs/Kconfig: move hpfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan da55e6f928 fs/Kconfig: move omfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan 8b1cd7d3c5 fs/Kconfig: move minix out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan 22135169dd fs/Kconfig: move vxfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan 22635ec9e0 fs/Kconfig: move squashfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan 2a22783be0 fs/Kconfig: move cramfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan 571f0a0bde fs/Kconfig: move efs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan 0ff423849d fs/Kconfig: move bfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan 0b09eb3298 fs/Kconfig: move befs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan b08bac1f18 fs/Kconfig: move hfs, hfsplus out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan 295c896cb9 fs/Kconfig: move ecryptfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan 10951bf05d fs/Kconfig: move affs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan bc2de2ae67 fs/Kconfig: move adfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan 4591dabe27 fs/Kconfig: move configfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan 5f3a211a8b fs/Kconfig: move sysfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan 9d73ac9e8f fs/Kconfig: move ntfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan 1c6ace019b fs/Kconfig: move fat out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan ddfaccd995 fs/Kconfig: move iso9660, udf out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan 3ef7784e47 fs/Kconfig: move fuse out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan 90ffd46793 fs/Kconfig: move autofs, autofs4 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan 335debee07 fs/Kconfig: move btrfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan 2fe4371dff fs/Kconfig: move ocfs2 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan f5c77969b3 fs/Kconfig: move jfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan b16ecfe2f9 fs/Kconfig: move reiserfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:53 +03:00
Dave Chinner 74e2d06521 Long btree pointers are still 64 bit on disk
[XFS] Long btree pointers are still 64 bit on disk

On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.

The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.

Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-01-22 01:23:11 -06:00
Jeff Layton 20d5a39929 dlm: initialize file_lock struct in GETLK before copying conflicting lock
dlm_posix_get fills out the relevant fields in the file_lock before
returning when there is a lock conflict, but doesn't clean out any of
the other fields in the file_lock.

When nfsd does a NFSv4 lockt call, it sets the fl_lmops to
nfsd_posix_mng_ops before calling the lower fs. When the lock comes back
after testing a lock on GFS2, it still has that field set. This confuses
nfsd into thinking that the file_lock is a nfsd4 lock.

Fix this by making DLM reinitialize the file_lock before copying the
fields from the conflicting lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-21 15:28:45 -06:00
David Teigland 24179f4880 dlm: fix plock notify callback to lockd
We should use the original copy of the file_lock, fl, instead
of the copy, flc in the lockd notify callback.  The range in flc has
been modified by posix_lock_file(), so it will not match a copy of the
lock in lockd.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-21 15:28:45 -06:00
Yehuda Sadeh 1506fcc818 Btrfs: fiemap support
Now that bmap support is gone, this is the only way to get extent
mappings for userland.  These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2009-01-21 14:39:14 -05:00
Chris Mason 35054394c4 Btrfs: stop providing a bmap operation to avoid swapfile corruptions
Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.

This causes problems for btrfs in a few ways:

btrfs returns logical block numbers through bmap, and these are not suitable
for IO.  They might translate to different devices, raid etc.

COW means that file block mappings are going to change frequently.

Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely.  A later commit
will add fiemap support for people that really want to know how
a file is laid out.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 13:11:13 -05:00
Yan Zheng 7237f18336 Btrfs: fix tree logs parallel sync
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 12:54:03 -05:00
Qinghuang Feng 7e6628544a Btrfs: open_ctree() error handling can oops on fs_info
a bug in open_ctree:

struct btrfs_root *open_ctree(..)
{
....
	if (!extent_root || !tree_root || !fs_info ||
	    !chunk_root || !dev_root || !csum_root) {
		err = -ENOMEM;
		goto fail;
//When code flow goes to "fail", fs_info may be NULL or uninitialized.
	}
....

fail:
	btrfs_close_devices(fs_info->fs_devices);// !
	btrfs_mapping_tree_free(&fs_info->mapping_tree);// !

	kfree(extent_root);
	kfree(tree_root);
	bdi_destroy(&fs_info->bdi);// !
...
)

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 86288a198d Btrfs: fix stop searching test in replace_one_extent
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.

The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.

The fix is get the full size from ram_bytes field of file extent item.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Jan Engelhardt 95029d7d59 Btrfs: change/remove typedef
Change one typedef to a regular enum, and remove an unused one.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi 653249ff9a Btrfs: remove duplicated #include
Removed duplicated #include "compat.h"in
fs/btrfs/extent-tree.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 5a7be515b1 Btrfs: Fix infinite loop in btrfs_extent_post_op
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.

finish_current_insert enters infinite loop if it only finds some backrefs to
update.  The fix is to check for pending backref updates before restarting the
loop.

The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 3dfdb9348a Btrfs: fix locking issue in btrfs_remove_block_group
We should hold the block_group_cache_lock while modifying the
block groups red-black tree. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Qinghuang Feng c6e308713a Btrfs: simplify iteration codes
Merge list_for_each* and list_entry to list_for_each_entry*

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:59:08 -05:00
Qinghuang Feng 57506d50ed Btrfs: check return value for kthread_run() correctly
kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Roland Dreier 119e10cf1b Btrfs: Remove extra KERN_INFO in the middle of a line
The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device()
actually follows another printk that doesn't end in a newline (since the
intention is for the two printks to make one line of output), so the
KERN_INFO just ends up messing up the output:

    device label exp <6>devid 1 transid 9 /dev/sda5

Fix this by changing the extra KERN_INFO to KERN_CONT.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi 7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00