Граф коммитов

10579 Коммитов

Автор SHA1 Сообщение Дата
Hidehiro Kawai 0e4fb5e283 ext3: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently.  Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error.  It's scary for mission critical systems.  On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable.  So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error.  If you mount it with data_err=ignore, it doesn't abort, just
call printk().  data_err=ignore is the default.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Mingming Cao 46d01a225e ext3: fix ext3 block reservation early ENOSPC issue
We could run into ENOSPC error on ext3, even when there is free blocks on
the filesystem.

The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2".  Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.

This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.

Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality.  But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.

2) we don't need to check the window size if the block reservation is off.

The problem was originally found and fixed in ext4.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Josef Bacik 972fbf7798 ext3: don't try to resize if there are no reserved gdt blocks left
When trying to resize a ext3 fs and you run out of reserved gdt blocks,
you get an error that doesn't actually tell you what went wrong, it just
says that the gdb it picked is not correct, which is the case since you
don't have any reserved gdt blocks left.  This patch adds a check to make
sure you have reserved gdt blocks to use, and if not prints out a more
relevant error.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Dilger <adilger@sun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Hidehiro Kawai 885e353c74 jbd: don't dirty original metadata buffer on abort
Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not.  Eventually these buffers will be
written-back to the filesystem by pdflush.  This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts.  So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state.  Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly.  Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata.  This would also break the consistency of the filesystem.

This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers.  Thus, no metadata
buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:37 -07:00
Hidehiro Kawai d1645e526a jbd: abort when failed to log metadata buffers
If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:36 -07:00
KOSAKI Motohiro e575f111dc coredump_filter: add hugepage dumping
Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g.  vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily.  Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area.  then these area reading may change device state.  it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better.  It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page.  Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

 - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
 - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#include "hugetlbfs.h"

int main(int argc, char** argv){
	char* p;
	int ch;
	int mmap_flags = MAP_SHARED;
	int fd;
	int nr_pages;

	while((ch = getopt(argc, argv, "p")) != -1) {
		switch (ch) {
		case 'p':
			mmap_flags &= ~MAP_SHARED;
			mmap_flags |= MAP_PRIVATE;
			break;
		default:
			/* nothing*/
			break;
		}
	}
	argc -= optind;
	argv += optind;

	if (argc == 0){
		printf("need # of pages\n");
		exit(1);
	}

	nr_pages = atoi(argv[0]);
	if (nr_pages < 2) {
		printf("nr_pages must >2\n");
		exit(1);
	}

	fd = hugetlbfs_unlinked_fd();
	p = mmap(NULL, nr_pages * gethugepagesize(),
		 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

	sleep(2);

	*(p + gethugepagesize()) = 1; /* COW */
	sleep(2);

	/* crash! */
	*(int*)0 = 1;

	return 0;
}

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Nick Piggin 51b07fc3c5 fs: buffer lock use lock bitops
trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Nick Piggin 5344b7e648 vmstat: mlocked pages statistics
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Lee Schermerhorn ba9ddf4939 Ramfs and Ram Disk pages are unevictable
Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists.  When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list.  Round and round she goes...

With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:

Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages.  This will provide for efficient testing of ramfs pages
in page_evictable().

Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.

Set the unevictable state on address_space structures for new ramfs
inodes.  Test the unevictable state in page_evictable() to cull
unevictable pages.

These changes depend on [CONFIG_]UNEVICTABLE_LRU.

[riel@redhat.com: undo the brd.c part]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Lee Schermerhorn 7b854121eb Unevictable LRU Page Statistics
Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Rik van Riel 4f98a2fee8 vmscan: split LRU lists into anon & file sets
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon").  The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists.  The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:25 -07:00
Thomas Gleixner c465a76af6 Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus 2008-10-20 13:14:06 +02:00
Steve French 3270958b71 [CIFS] undo changes in cifs_rename_pending_delete if it errors out
The cifs_rename_pending_delete process involves multiple steps. If it
fails and we're going to return error, we don't want to leave things in
a half-finished state. Add code to the function to undo changes if
a call fails.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-20 00:44:19 +00:00
Jeff Layton 9a8165fce7 cifs: track DeletePending flag in cifsInodeInfo
cifs: track DeletePending flag in cifsInodeInfo

The QPathInfo call returns a flag that indicates whether DELETE_ON_CLOSE
is set. Track it in the cifsInodeInfo.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-20 00:33:52 +00:00
Geert Uytterhoeven 54779aabb0 UBIFS: fix ubifs_compress commentary
Update the comment for ubifs_compress(), which incorrectly states that it
returnsa success/failure indicator.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-10-19 13:01:37 +03:00
Artem Bityutskiy fae7fb299f UBIFS: amend printk
It is better to print "Reserved for root" than
"Reserved pool size", because it is more obvious for users
what this means.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2008-10-19 13:01:30 +03:00
Adrian Hunter 727d2dc045 UBIFS: do not read unnecessary bytes when unpacking bits
Fixes the following Oops:

BUG: unable to handle kernel paging request at f8d24000
IP: [<f8ff0657>] :ubifs:ubifs_unpack_bits+0xcd/0x231
*pde = 34333067 *pte = 00000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: deflate zlib_deflate lzo lzo_decompress lzo_compress
ubifs ubi nandsim nand nand_ids nand_ecc mtd nfsd lockd sunrpc exportfs
[last unloaded: nand_ecc]

Pid: 7450, comm: sync Not tainted (2.6.27-rc8-ubifs-2.6 #27)
EIP: 0060:[<f8ff0657>] EFLAGS: 00010206 CPU: 0
EIP is at ubifs_unpack_bits+0xcd/0x231 [ubifs]
EAX: 00000000 EBX: 00000000 ECX: d7e43dc0 EDX: 0000ff00
ESI: 00000004 EDI: f8d23ffe EBP: d7e43db4 ESP: d7e43d8c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process sync (pid: 7450, ti=d7e42000 task=eb6f9530 task.ti=d7e42000)
Stack: 00000400 c0103db4 dc5e8090 d7e43dc0 d7e43dc0 d7e43dc4 0000001c 00000004
      f496d1e0 f8d23ffc d7e43dd4 f8ffac7e f8d23ffe 00000000 f8d23ffe f2b7af68
      f496d1e0 f8d23ffc d7e43e2c f8ffadc5 00000000 0001f000 00000000 c03b10a7
Call Trace:
[<c0103db4>] ? restore_nocheck_notrace+0x0/0xe
[<f8ffac7e>] ? is_a_node+0x43/0x92 [ubifs]
[<f8ffadc5>] ? dbg_check_ltab+0xf8/0x5c9 [ubifs]
[<c03b10a7>] ? mutex_lock_nested+0x1b2/0x2a0
[<f8ffc86e>] ? ubifs_lpt_start_commit+0x49/0xecb [ubifs]
[<c03b0ef3>] ? mutex_unlock+0xd/0xf
[<f8fef017>] ? ubifs_tnc_start_commit+0x1cf/0xef8 [ubifs]
[<f8fe65d8>] ? do_commit+0x18f/0x52d [ubifs]
[<f8fe69f6>] ? ubifs_run_commit+0x80/0xca [ubifs]
[<f8fd8d35>] ? ubifs_sync_fs+0xdb/0xf6 [ubifs]
[<c0181a07>] ? sync_filesystems+0xc6/0x10c
[<c019f279>] ? do_sync+0x3b/0x6a
[<c019f2ba>] ? sys_sync+0x12/0x18
[<c0103ced>] ? sysenter_do_call+0x12/0x35
=======================
Code: 4d ec 89 01 8b 45 e8 89 10 89 d8 89 f1 d3 e8 85 c0 74 07 29 d6 83 fe
20 75 2a 89 d8 83 c4 1c 5b 5e 5f 5d c3 0f b6 57 01 c1 e2 08 <0f> b6 47 02
c1 e0 10 09 c2 0f b6 07 09 c2 0f b
EIP: [<f8ff0657>] ubifs_unpack_bits+0xcd/0x231 [ubifs] SS:ESP 0068:d7e43d8c
---[ end trace 1bbb4c407a6dd816 ]---

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
2008-10-19 13:01:21 +03:00
Alexander Belyakov 5bf1723723 [JFFS2] Write buffer offset adjustment for NOR-ECC (Sibley) flash
After choosing new c->nextblock, don't leave the wbuf offset field
occasionally pointing at the start of the next physical eraseblock.
This was causing a BUG() on NOR-ECC (Sibley) flash, where we start
writing after the cleanmarker.

Among other this fix should cover write buffer offset adjustment
after flushing the last page of an eraseblock.

Signed-off-by: Alexander Belyakov <abelyako@googlemail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-10-18 11:54:09 +01:00
Linus Torvalds 58617d5e59 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Remove automatic enabling of the HUGE_FILE feature flag
  ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback
  ext4: Update Documentation/filesystems/ext4.txt
  ext4: Remove unused mount options: nomballoc, mballoc, nocheck
  ext4: Remove compile warnings when building w/o CONFIG_PROC_FS
  ext4: Add missing newlines to printk messages
  ext4: Fix file fragmentation during large file write.
  vfs: Add no_nrwrite_index_update writeback control flag
  vfs: Remove the range_cont writeback mode.
  ext4: Use tag dirty lookup during mpage_da_submit_io
  ext4: let the block device know when unused blocks can be discarded
  ext4: Don't reuse released data blocks until transaction commits
  ext4: Use an rbtree for tracking blocks freed during transaction.
  ext4: Do mballoc init before doing filesystem recovery
  ext4: Free ext4_prealloc_space using kmem_cache_free
  ext4: Fix Kconfig typo for ext4dev
  ext4: Remove an old reference to ext4dev in Makefile comment
2008-10-17 15:08:11 -07:00
Magnus Deininger 57c7b4e68e 9p: fix device file handling
In v9fs_get_inode(), for block, as well as char devices (in theory), 
the function init_special_inode() is called to set up callback functions 
for file ops. this function uses the file mode's value to determine whether 
to use block or char dev functions. In v9fs_inode_from_fid(), the function 
p9mode2unixmode() is used, but for all devices it initially returns S_IFBLK, 
then uses v9fs_get_inode() to initialise a new inode, then finally uses 
v9fs_stat2inode(), which would determine whether the inode is a block or 
character device. However, at that point init_special_inode() had already 
decided to use the block device functions, so even if the inode's mode is 
turned to a character device, the block functions are still used to operate 
on them. The attached patch simply calls init_special_inode() again for devices 
after parsing device node data in v9fs_stat2inode() so that the proper functions 
are used.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 12:44:46 -05:00
Andy Adamson ec9a05c94c NFS: use correct fs type for v4 submounts and referrals
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:48 -04:00
Neil Brown 504e518953 Make nfs_file_cred more robust.
As not all files have an associated open_context (e.g. device special
files), it is safest to test for the existence of the open context
before de-referencing it.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:45 -04:00
Chuck Lever 18de973530 NFS: Enable NFSv4 callback server to listen on AF_INET6 sockets
Allow the NFS callback server to listen for requests via an AF_INET6 or
AF_INET socket when IPv6 support is present in the kernel.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-17 13:06:41 -04:00
Arjan van de Ven 651dab4264 Merge commit 'linus/master' into merge-linus
Conflicts:

	arch/x86/kvm/i8254.c
2008-10-17 09:20:26 -07:00
Eric Van Hensbergen 02da398b95 9p: eliminate depricated conv functions
Remove depricated conv functions which have been replaced with new 
protocol routines.

This patch also reworks the one instance of the file-system code which
directly calls conversion routines (to accomplish unpacking dirreads).

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:06:57 -05:00
Eric Van Hensbergen 51a87c552d 9p: rework client code to use new protocol support functions
Now that the new protocol functions are in place, this patch switches
the client code to using the new support code.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:45 -05:00
Eric Van Hensbergen 06b55b464e 9p: move dirread to fs layer
Currently reading a directory is implemented in the client code.
This function is not actually a wire operation, but a meta operation 
which calls read operations and processes the results.

This patch moves this functionality to the fs layer and calls component
wire operations instead of constructing their packets.  This provides a 
cleaner separation and will help when we reorganize the client functions
and protocol processing methods.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:43 -05:00
Eric Van Hensbergen dfb0ec2e13 9p: adjust 9p vfs write operation
Currently, the 9p net wire operation ensures that all data is sent by sending
multiple packets if the data requested is larger than the msize.  This is
better handled in the vfs code so that we can simplify wire operations to 
being concerned with only putting data onto and taking data off of the wire.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:43 -05:00
Eric Van Hensbergen fbedadc16e 9p: move readn meta-function from client to fs layer
There are a couple of methods in the client code which aren't actually
wire operations.  To keep things organized cleaner, these operations are
being moved to the fs layer.

This patch moves the readn meta-function (which executes multiple wire
reads until a buffer is full) to the fs layer.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:43 -05:00
Eric Van Hensbergen 0fc9655ec6 9p: consolidate read/write functions
Currently there are two separate versions of read and write.  One for
dealing with user buffers and the other for dealing with kernel buffers.
There is a tremendous amount of code duplication in the otherwise
identical versions of these functions.  This patch adds an additional
user buffer parameter to read and write and conditionalizes handling of
the buffer on whether the kernel buffer or the user buffer is populated.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:42 -05:00
Eric Van Hensbergen 8b81ef589a 9p: consolidate transport structure
Right now there is a transport module structure which provides per-transport
type functions and data and a transport structure which contains per-instance
public data as well as function pointers to instance specific functions.

This patch moves public transport visible instance data to the client
structure (which in some cases had duplicate data) and consolidates the
functions into the transport module structure.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-10-17 11:04:41 -05:00
Geert Uytterhoeven faa5c2a15e [JFFS2] Correct parameter names of jffs2_compress() in comments
Make the parameter names of jffs2_compress() in its comments match with the
actual implementation

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-10-17 15:56:44 +01:00
Jeff Layton dd1db2dedc cifs: don't use CREATE_DELETE_ON_CLOSE in cifs_rename_pending_delete
cifs: don't use CREATE_DELETE_ON_CLOSE in cifs_rename_pending_delete

CREATE_DELETE_ON_CLOSE apparently has different semantics than when you
set the DELETE_ON_CLOSE bit after opening the file. Setting it in the
open says "delete this file as soon as this filehandle is closed". That's
not what we want for cifs_rename_pending_delete.

Don't set this bit in the CreateFlags. Experimentation shows that
setting this flag in the SET_FILE_INFO call has no effect.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-17 14:47:13 +00:00
Randy Dunlap 496aa8a98f block: fix current kernel-doc warnings
Fix block kernel-doc warnings:

Warning(linux-2.6.27-git4//fs/block_dev.c:1272): No description found for parameter 'path'
Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'cpu'
Warning(linux-2.6.27-git4//block/blk-core.c:1021): No description found for parameter 'part'
Warning(/var/linsrc/linux-2.6.27-git4//block/genhd.c:544): No description found for parameter 'partno'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-17 08:46:57 +02:00
Tejun Heo 0fc71e3d65 block: add partition attribute for partition number
With extended devt, finding out the partition number becomes a bit
more challenging as subtracting the minor number from that of the
parent device doesn't work anymore.  The only thing left is parsing
the partition name which is brittle and not exactly universal (some
have '-' between the device name and partition number while others
don't).  This patch introduced partition attribute which contains the
partition number of the device.  This should make finding partitions
and its index easier.

This problem and solution were suggested by H. Peter Anvin.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-17 08:46:56 +02:00
Theodore Ts'o f287a1a561 ext4: Remove automatic enabling of the HUGE_FILE feature flag
If the HUGE_FILE feature flag is not set, don't allow the creation of
large files, instead of automatically enabling the feature flag.
Recent versions of mke2fs will set the HUGE_FILE flag automatically
anyway for ext4 filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-16 22:50:48 -04:00
Theodore Ts'o 3e624fc72f ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback
The multiblock allocator needs to be able to release blocks (and issue
a blkdev discard request) when the transaction which freed those
blocks is committed.  Previously this was done via a polling mechanism
when blocks are allocated or freed.  A much better way of doing things
is to create a jbd2 callback function and attaching the list of blocks
to be freed directly to the transaction structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-16 20:00:24 -04:00
Theodore Ts'o 01436ef2e4 ext4: Remove unused mount options: nomballoc, mballoc, nocheck
These mount options don't actually do anything any more, so remove
them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 07:22:35 -04:00
Manish Katiyar 0b09923eab ext4: Remove compile warnings when building w/o CONFIG_PROC_FS
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 14:58:45 -04:00
Eric Sesterhenn 5128273a32 ext4: Add missing newlines to printk messages
There are some newlines missing in ext4_check_descriptors, which
cause the printk level to be printed out when the next printk call
is made:

[  778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0
not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted!
[  802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0
not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted!

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 09:16:19 -04:00
Linus Torvalds 52ad096465 Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (53 commits)
  NFS: Fix a resolution problem with nfs_inode->cache_change_attribute
  NFS: Fix the resolution problem with nfs_inode_attrs_need_update()
  NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag
  RPC/RDMA: ensure connection attempt is complete before signalling.
  RPC/RDMA: correct the reconnect timer backoff
  RPC/RDMA: optionally emit useful transport info upon connect/disconnect.
  RPC/RDMA: reformat a debug printk to keep lines together.
  RPC/RDMA: harden connection logic against missing/late rdma_cm upcalls.
  RPC/RDMA: fix connect/reconnect resource leak.
  RPC/RDMA: return a consistent error, when connect fails.
  RPC/RDMA: adhere to protocol for unpadded client trailing write chunks.
  RPC/RDMA: avoid an oops due to disconnect racing with async upcalls.
  RPC/RDMA: maintain the RPC task bytes-sent statistic.
  RPC/RDMA: suppress retransmit on RPC/RDMA clients.
  RPC/RDMA: fix connection IRD/ORD setting
  RPC/RDMA: support FRMR client memory registration.
  RPC/RDMA: check selected memory registration mode at runtime.
  RPC/RDMA: add data types and new FRMR memory registration enum.
  RPC/RDMA: refactor the inline memory registration code.
  NFS: fix nfs_parse_ip_address() corner case
  ...
2008-10-16 15:39:20 -07:00
Linus Torvalds c813b4e16e Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits)
  UIO: Fix mapping of logical and virtual memory
  UIO: add automata sercos3 pci card support
  UIO: Change driver name of uio_pdrv
  UIO: Add alignment warnings for uio-mem
  Driver core: add bus_sort_breadthfirst() function
  NET: convert the phy_device file to use bus_find_device_by_name
  kobject: Cleanup kobject_rename and !CONFIG_SYSFS
  kobject: Fix kobject_rename and !CONFIG_SYSFS
  sysfs: Make dir and name args to sysfs_notify() const
  platform: add new device registration helper
  sysfs: use ilookup5() instead of ilookup5_nowait()
  PNP: create device attributes via default device attributes
  Driver core: make bus_find_device_by_name() more robust
  usb: turn dev_warn+WARN_ON combos into dev_WARN
  debug: use dev_WARN() rather than WARN_ON() in device_pm_add()
  debug: Introduce a dev_WARN() function
  sysfs: fix deadlock
  device model: Do a quickcheck for driver binding before doing an expensive check
  Driver core: Fix cleanup in device_create_vargs().
  Driver core: Clarify device cleanup.
  ...
2008-10-16 12:40:26 -07:00
Linus Torvalds c8d8a2321f Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: remove CONFIG_KMOD in comment after #endif
  remove CONFIG_KMOD from fs
  remove CONFIG_KMOD from drivers

Manually fix conflict due to include cleanups in drivers/md/md.c
2008-10-16 12:38:34 -07:00
Linus Torvalds e4856a70cf Merge branch 'personality' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'personality' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
  [PATCH] remove unused ibcs2/PER_SVR4 in SET_PERSONALITY
2008-10-16 12:32:52 -07:00
Jeff Layton 469ee614aa [CIFS] eliminate usage of kthread_stop for cifsd
When cifs_demultiplex_thread was converted to a kthread based kernel
thread, great pains were taken to make it so that kthread_stop would be
used to bring it down. This just added unnecessary complexity since we
needed to use a signal anyway to break out of kernel_recvmsg.

Also, cifs_demultiplex_thread does a bit of cleanup as it's exiting, and
we need to be certain that this gets done. It's possible for a kthread
to exit before its main function is ever run if kthread_stop is called
soon after its creation. While I'm not sure that this is a real problem
with cifsd now, it could be at some point in the future if cifs_mount is
ever changed to bring down the thread quickly.

The upshot here is that using kthread_stop to bring down the thread just
adds extra complexity with no real benefit. This patch changes the code
to use the original method to bring down the thread, but still leaves it
so that the thread is actually started with kthread_run.

This seems to fix the deadlock caused by the reproducer in this bug
report:

https://bugzilla.samba.org/show_bug.cgi?id=5720

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-16 18:46:39 +00:00
Steve French 2c1b861539 [CIFS] Add nodfs mount option
Older samba server (eg. 3.0.24 from Debian etch) don't work correctly,
if DFS paths are used. Such server claim that they support DFS, but fail
to process some requests with DFS paths. Starting with Linux 2.6.26,
the cifs clients starts sending DFS paths in such situations, rendering
it unuseable with older samba servers.

The nodfs mount options forces a share to be used with non DFS paths,
even if the server claims, that it supports it.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-16 18:35:21 +00:00
Thomas Petazzoni ebf3f09c63 Configure out AIO support
This patchs adds the CONFIG_AIO option which allows to remove support
for asynchronous I/O operations, that are not necessarly used by
applications, particularly on embedded devices. As this is a
size-reduction option, it depends on CONFIG_EMBEDDED. It allows to
save ~7 kilobytes of kernel code/data:

   text	   data	    bss	    dec	    hex	filename
1115067	 119180	 217088	1451335	 162547	vmlinux
1108025	 119048	 217088	1444161	 160941	vmlinux.new
  -7042    -132       0   -7174   -1C06 +/-

This patch has been originally written by Matt Mackall
<mpm@selenic.com>, and is part of the Linux Tiny project.

[randy.dunlap@oracle.com: build fix]
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:51 -07:00
Nick Piggin 15b4650e55 afs: convert to new aops
Cannot assume writes will fully complete, so this conversion goes the easy
way and always brings the page uptodate before the write.

[dhowells@redhat.com: style tweaks]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:48 -07:00
Oleg Nesterov 07edbde508 pid_ns: de_thread: kill the now unneeded ->child_reaper change
de_thread() checks if the old leader was the ->child_reaper, this is not
possible any longer.  With the previous patch ->group_leader itself will
change ->child_reaper on exit.

Henceforth find_new_reaper() is the only function (apart from
initialization) which plays with ->child_reaper.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Alexey Dobriyan f40cbaa5b0 proc: move sysrq-trigger out of fs/proc/
Move it into sysrq.c, along with the rest of the sysrq implementation.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Kay Sievers ac0d86f580 block: sanitize invalid partition table entries
We currently follow blindly what the partition table lies about the
disk, and let the kernel create block devices which can not be accessed.
Trying to identify the device leads to kernel logs full of:
  sdb: rw=0, want=73392, limit=28800
  attempt to access beyond end of device

Here is an example of a broken partition table, where sda2 starts
behind the end of the disk, and sdb3 is larger than the entire disk:
  Disk /dev/sdb: 14 MB, 14745600 bytes
  1 heads, 29 sectors/track, 993 cylinders, total 28800 sectors
     Device Boot      Start         End      Blocks   Id  System
  /dev/sdb1              29        7800        3886   83  Linux
  /dev/sdb2           37801       45601        3900+  83  Linux
  /dev/sdb3           15602       73402       28900+  83  Linux
  /dev/sdb4           23403       28796        2697   83  Linux

The kernel creates these completely invalid devices, which can not be
accessed, or may lead to other unpredictable failures:
  grep . /sys/class/block/sdb*/{start,size}
  /sys/class/block/sdb/size:28800
  /sys/class/block/sdb1/start:29
  /sys/class/block/sdb1/size:7772
  /sys/class/block/sdb2/start:37801
  /sys/class/block/sdb2/size:7801
  /sys/class/block/sdb3/start:15602
  /sys/class/block/sdb3/size:57801
  /sys/class/block/sdb4/start:23403
  /sys/class/block/sdb4/size:5394

With this patch, we ignore partitions which start behind the end of the disk,
and limit partitions to the end of the disk if they pretend to be larger:
  grep . /sys/class/block/sdb*/{start,size}
  /sys/class/block/sdb/size:28800
  /sys/class/block/sdb1/start:29
  /sys/class/block/sdb1/size:7772
  /sys/class/block/sdb3/start:15602
  /sys/class/block/sdb3/size:13198
  /sys/class/block/sdb4/start:23403
  /sys/class/block/sdb4/size:5394

These warnings are printed to the kernel log:
  sdb: p2 ignored, start 37801 is behind the end of the disk
  sdb: p3 size 57801 limited to end of disk

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Adrian Bunk 6722e45c2d fs/partitions/acorn.c: remove dead code
I missed this when I did the arm26 removal.

Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Alexey Dobriyan 4cea5ceb4c COMPAT_BINFMT_ELF definition tweak
Don't repeat BINFMT_ELF definition, simply multiply COMPAT and BINFMT_ELF.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Paul Mundt 5edc2a5123 binfmt_elf_fdpic: wire up AT_EXECFD, AT_EXECFN, AT_SECURE
These auxvec entries are the only ones left unhandled out of the current
base implementation. This syncs up binfmt_elf_fdpic with linux/auxvec.h
and current binfmt_elf.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Paul Mundt c7637941d1 binfmt_elf_fdpic: convert initial stack alignment to arch_align_stack()
binfmt_elf_fdpic seems to have grabbed a hard-coded hack from an ancient
version of binfmt_elf in order to try and fix up initial stack alignment
on multi-threaded x86, which while in addition to being unused, was also
pushed down beyond the first set of operations on the stack pointer,
negating the entire purpose.

These days, we have an architecture independent arch_align_stack(), so we
switch to using that instead. Move the initial alignment up before the
initial stores while we're at it.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Paul Mundt ec23847d6c binfmt_elf_fdpic: support auxvec base platform string
Commit 483fad1c3f ("ELF loader support for
auxvec base platform string") introduced AT_BASE_PLATFORM, but only
implemented it for binfmt_elf.

Given that AT_VECTOR_SIZE_BASE is unconditionally enlarged for us, and
it's only optionally added in for the platforms that set
ELF_BASE_PLATFORM, wire it up for binfmt_elf_fdpic, too.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Adrian Bunk b73c29f6b0 quota: remove CVS keywords
Remove CVS keywords that weren't updated for a long time from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Julien Brunel 67b172c097 fs/reiserfs: use an IS_ERR test rather than a NULL test
In case of error, the function open_xa_dir returns an ERR pointer, but
never returns a NULL pointer.  So a NULL test that comes after an IS_ERR
test should be deleted.

The semantic match that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@match_bad_null_test@
expression x, E;
statement S1,S2;
@@
x = open_xa_dir(...)
... when != x = E
(
*  if (x == NULL && ...) S1 else S2
|
*  if (x == NULL || ...) S1 else S2
)
// </smpl>

Signed-off-by: Julien Brunel <brunel@diku.dk>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Adrian Bunk 6b23ea7679 reiserfs/procfs.c: remove CVS keywords
Remove CVS keywords that weren't updated for a long time from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Eric Sesterhenn d38b7aa7fc hfs: fix namelength memory corruption
Fix a stack corruption caused by a corrupted hfs filesystem.  If the
catalog name length is corrupted the memcpy overwrites the catalog btree
structure.  Since the field is limited to HFS_NAMELEN bytes in the
structure and the file format, we throw an error if it is too long.

Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Eric Sesterhenn 649f1ee6c7 hfsplus: check read_mapping_page() return value
While testing more corrupted images with hfsplus, i came across
one which triggered the following bug:

[15840.675016] BUG: unable to handle kernel paging request at fffffffb
[15840.675016] IP: [<c0116a4f>] kmap+0x15/0x56
[15840.675016] *pde = 00008067 *pte = 00000000
[15840.675016] Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
[15840.675016] Modules linked in:
[15840.675016]
[15840.675016] Pid: 11575, comm: ln Not tainted (2.6.27-rc4-00123-gd3ee1b4-dirty #29)
[15840.675016] EIP: 0060:[<c0116a4f>] EFLAGS: 00010202 CPU: 0
[15840.675016] EIP is at kmap+0x15/0x56
[15840.675016] EAX: 00000246 EBX: fffffffb ECX: 00000000 EDX: cab919c0
[15840.675016] ESI: 000007dd EDI: cab0bcf4 EBP: cab0bc98 ESP: cab0bc94
[15840.675016]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[15840.675016] Process ln (pid: 11575, ti=cab0b000 task=cab919c0 task.ti=cab0b000)
[15840.675016] Stack: 00000000 cab0bcdc c0231cfb 00000000 cab0bce0 00000800 ca9290c0 fffffffb
[15840.675016]        cab145d0 cab919c0 cab15998 22222222 22222222 22222222 00000001 cab15960
[15840.675016]        000007dd cab0bcf4 cab0bd04 c022cb3a cab0bcf4 cab15a6c ca9290c0 00000000
[15840.675016] Call Trace:
[15840.675016]  [<c0231cfb>] ? hfsplus_block_allocate+0x6f/0x2d3
[15840.675016]  [<c022cb3a>] ? hfsplus_file_extend+0xc4/0x1db
[15840.675016]  [<c022ce41>] ? hfsplus_get_block+0x8c/0x19d
[15840.675016]  [<c06adde4>] ? sub_preempt_count+0x9d/0xab
[15840.675016]  [<c019ece6>] ? __block_prepare_write+0x147/0x311
[15840.675016]  [<c0161934>] ? __grab_cache_page+0x52/0x73
[15840.675016]  [<c019ef4f>] ? block_write_begin+0x79/0xd5
[15840.675016]  [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d
[15840.675016]  [<c019f22a>] ? cont_write_begin+0x27f/0x2af
[15840.675016]  [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d
[15840.675016]  [<c0139ebe>] ? tick_program_event+0x28/0x4c
[15840.675016]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[15840.675016]  [<c022b723>] ? hfsplus_write_begin+0x2d/0x32
[15840.675016]  [<c022cdb5>] ? hfsplus_get_block+0x0/0x19d
[15840.675016]  [<c0161988>] ? pagecache_write_begin+0x33/0x107
[15840.675016]  [<c01879e5>] ? __page_symlink+0x3c/0xae
[15840.675016]  [<c019ad34>] ? __mark_inode_dirty+0x12f/0x137
[15840.675016]  [<c0187a70>] ? page_symlink+0x19/0x1e
[15840.675016]  [<c022e6eb>] ? hfsplus_symlink+0x41/0xa6
[15840.675016]  [<c01886a9>] ? vfs_symlink+0x99/0x101
[15840.675016]  [<c018a2f6>] ? sys_symlinkat+0x6b/0xad
[15840.675016]  [<c018a348>] ? sys_symlink+0x10/0x12
[15840.675016]  [<c01038bd>] ? sysenter_do_call+0x12/0x31
[15840.675016]  =======================
[15840.675016] Code: 00 00 75 10 83 3d 88 2f ec c0 02 75 07 89 d0 e8 12 56 05 00 5d c3 55 ba 06 00 00 00 89 e5 53 89 c3 b8 3d eb 7e c0 e8 16 74 00 00 <8b> 03 c1 e8 1e 69 c0 d8 02 00 00 05 b8 69 8e c0 2b 80 c4 02 00
[15840.675016] EIP: [<c0116a4f>] kmap+0x15/0x56 SS:ESP 0068:cab0bc94
[15840.675016] ---[ end trace 4fea40dad6b70e5f ]---

This happens because the return value of read_mapping_page() is passed on
to kmap unchecked.  The bug is triggered after the first
read_mapping_page() in hfsplus_block_allocate(), this patch fixes all
three usages in this functions but leaves the ones further down in the
file unchanged.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Eric Sesterhenn efc7ffcb42 hfsplus: fix Buffer overflow with a corrupted image
When an hfsplus image gets corrupted it might happen that the catalog
namelength field gets b0rked.  If we mount such an image the memcpy() in
hfsplus_cat_build_key_uni() writes more than the 255 that fit in the name
field.  Depending on the size of the overwritten data, we either only get
memory corruption or also trigger an oops like this:

[  221.628020] BUG: unable to handle kernel paging request at c82b0000
[  221.629066] IP: [<c022d4b1>] hfsplus_find_cat+0x10d/0x151
[  221.629066] *pde = 0ea29163 *pte = 082b0160
[  221.629066] Oops: 0002 [#1] PREEMPT DEBUG_PAGEALLOC
[  221.629066] Modules linked in:
[  221.629066]
[  221.629066] Pid: 4845, comm: mount Not tainted (2.6.27-rc4-00123-gd3ee1b4-dirty #28)
[  221.629066] EIP: 0060:[<c022d4b1>] EFLAGS: 00010206 CPU: 0
[  221.629066] EIP is at hfsplus_find_cat+0x10d/0x151
[  221.629066] EAX: 00000029 EBX: 00016210 ECX: 000042c2 EDX: 00000002
[  221.629066] ESI: c82d70ca EDI: c82b0000 EBP: c82d1bcc ESP: c82d199c
[  221.629066]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  221.629066] Process mount (pid: 4845, ti=c82d1000 task=c8224060 task.ti=c82d1000)
[  221.629066] Stack: c080b3c4 c82aa8f8 c82d19c2 00016210 c080b3be c82d1bd4 c82aa8f0 00000300
[  221.629066]        01000000 750008b1 74006e00 74006900 65006c00 c82d6400 c013bd35 c8224060
[  221.629066]        00000036 00000046 c82d19f0 00000082 c8224548 c8224060 00000036 c0d653cc
[  221.629066] Call Trace:
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c0107aa3>] ? native_sched_clock+0x82/0x96
[  221.629066]  [<c01302d2>] ? __kernel_text_address+0x1b/0x27
[  221.629066]  [<c010487a>] ? dump_trace+0xca/0xd6
[  221.629066]  [<c0109e32>] ? save_stack_address+0x0/0x2c
[  221.629066]  [<c0109eaf>] ? save_stack_trace+0x1c/0x3a
[  221.629066]  [<c013b571>] ? save_trace+0x37/0x8d
[  221.629066]  [<c013b62e>] ? add_lock_to_list+0x67/0x8d
[  221.629066]  [<c013ea1c>] ? validate_chain+0x8a4/0x9f4
[  221.629066]  [<c013553d>] ? down+0xc/0x2f
[  221.629066]  [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c0107aa3>] ? native_sched_clock+0x82/0x96
[  221.629066]  [<c013da5d>] ? mark_held_locks+0x43/0x5a
[  221.629066]  [<c013dc3a>] ? trace_hardirqs_on+0xb/0xd
[  221.629066]  [<c013dbf4>] ? trace_hardirqs_on_caller+0xf4/0x12f
[  221.629066]  [<c06abec8>] ? _spin_unlock_irqrestore+0x42/0x58
[  221.629066]  [<c013555c>] ? down+0x2b/0x2f
[  221.629066]  [<c022aa68>] ? hfsplus_iget+0xa0/0x154
[  221.629066]  [<c022b0b9>] ? hfsplus_fill_super+0x280/0x447
[  221.629066]  [<c0107aa3>] ? native_sched_clock+0x82/0x96
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0
[  221.629066]  [<c041c9e4>] ? string+0x2b/0x74
[  221.629066]  [<c041cd16>] ? vsnprintf+0x2e9/0x512
[  221.629066]  [<c010487a>] ? dump_trace+0xca/0xd6
[  221.629066]  [<c0109eaf>] ? save_stack_trace+0x1c/0x3a
[  221.629066]  [<c0109eaf>] ? save_stack_trace+0x1c/0x3a
[  221.629066]  [<c013b571>] ? save_trace+0x37/0x8d
[  221.629066]  [<c013b62e>] ? add_lock_to_list+0x67/0x8d
[  221.629066]  [<c013ea1c>] ? validate_chain+0x8a4/0x9f4
[  221.629066]  [<c01354d3>] ? up+0xc/0x2f
[  221.629066]  [<c013f1f6>] ? __lock_acquire+0x68a/0x6e0
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c013bca3>] ? trace_hardirqs_off_caller+0x14/0x9b
[  221.629066]  [<c013bd35>] ? trace_hardirqs_off+0xb/0xd
[  221.629066]  [<c0107aa3>] ? native_sched_clock+0x82/0x96
[  221.629066]  [<c041cfb7>] ? snprintf+0x1b/0x1d
[  221.629066]  [<c01ba466>] ? disk_name+0x25/0x67
[  221.629066]  [<c0183960>] ? get_sb_bdev+0xcd/0x10b
[  221.629066]  [<c016ad92>] ? kstrdup+0x2a/0x4c
[  221.629066]  [<c022a7b3>] ? hfsplus_get_sb+0x13/0x15
[  221.629066]  [<c022ae39>] ? hfsplus_fill_super+0x0/0x447
[  221.629066]  [<c0183583>] ? vfs_kern_mount+0x3b/0x76
[  221.629066]  [<c0183602>] ? do_kern_mount+0x32/0xba
[  221.629066]  [<c01960d4>] ? do_new_mount+0x46/0x74
[  221.629066]  [<c0196277>] ? do_mount+0x175/0x193
[  221.629066]  [<c013dbf4>] ? trace_hardirqs_on_caller+0xf4/0x12f
[  221.629066]  [<c01663b2>] ? __get_free_pages+0x1e/0x24
[  221.629066]  [<c06ac07b>] ? lock_kernel+0x19/0x8c
[  221.629066]  [<c01962e6>] ? sys_mount+0x51/0x9b
[  221.629066]  [<c01962f9>] ? sys_mount+0x64/0x9b
[  221.629066]  [<c01038bd>] ? sysenter_do_call+0x12/0x31
[  221.629066]  =======================
[  221.629066] Code: 89 c2 c1 e2 08 c1 e8 08 09 c2 8b 85 e8 fd ff ff 66 89 50 06 89 c7 53 83 c7 08 56 57 68 c4 b3 80 c0 e8 8c 5c ef ff 89 d9 c1 e9 02 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 83 c3 06 8b 95 e8 fd ff ff 0f
[  221.629066] EIP: [<c022d4b1>] hfsplus_find_cat+0x10d/0x151 SS:ESP 0068:c82d199c
[  221.629066] ---[ end trace e417a1d67f0d0066 ]---

Since hfsplus_cat_build_key_uni() returns void and only has one callsite,
the check is performed at the callsite.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Mike Crowe 81a73719d1 hfsplus: quieten down mounting hfsplus journaled fs read only
Check whether the file system was to be mounted read only anyway before
warning about changing the mount to read only.

Signed-off-by: Mike Crowe <mac@mcrowe.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Harvey Harrison 152b95a1ed befs: annotate fs32 on tests for superblock endianness
Does compile-time byteswapping rather than runtime.

Noticed by sparse:
fs/befs/super.c:29:6: warning: cast to restricted __le32
fs/befs/super.c:29:6: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/super.c:31:11: warning: cast to restricted __be32
fs/befs/super.c:31:11: warning: cast from restricted fs32
fs/befs/linuxvfs.c:811:7: warning: cast to restricted __le32
fs/befs/linuxvfs.c:811:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32
fs/befs/linuxvfs.c:812:7: warning: cast to restricted __be32
fs/befs/linuxvfs.c:812:7: warning: cast from restricted fs32

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: "Sergey S. Kostyliov" <rathamahata@php4.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Eric Sandeen bd39597cbd ext2: avoid printk floods in the face of directory corruption
A very large directory with many read failures (either due to storage
problems, or due to invalid size & blocks from corruption) will generate a
printk storm as the filesystem continues to try to read all the blocks.
This flood of messages can tie up the box until it is complete - which may
be a very long time, especially for very large corrupted values.

This is fixed by only reporting the corruption once each time we try to
read the directory.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:46 -07:00
Mingming Cao d707d31c97 ext2: fix ext2 block reservation early ENOSPC issue
We could run into ENOSPC error on ext2, even when there is free blocks on
the filesystem.

The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2".  Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.

This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.

Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality.  But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.

2) we don't need to check the window size if the block reservation is off.

The problem was originally found and fixed in ext4.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:45 -07:00
Ian Kent 8d7b48e0bc autofs4: add miscellaneous device for ioctls
Add a miscellaneous device to the autofs4 module for routing ioctls.  This
provides the ability to obtain an ioctl file handle for an autofs mount
point that is possibly covered by another mount.

The actual problem with autofs is that it can't reconnect to existing
mounts.  Immediately one things of just adding the ability to remount
autofs file systems would solve it, but alas, that can't work.  This is
because autofs direct mounts and the implementation of "on demand mount
and expire" of nested mount trees have the file system mounted on top of
the mount trigger dentry.

To resolve this a miscellaneous device node for routing ioctl commands to
these mount points has been implemented in the autofs4 kernel module and a
library added to autofs.  This provides the ability to open a file
descriptor for these over mounted autofs mount points.

Please refer to Documentation/filesystems/autofs4-mount-control.txt for a
discussion of the problem, implementation alternatives considered and a
description of the interface.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:39 -07:00
Ian Kent c0f54d3e54 autofs4: track uid and gid of last mount requester
Track the uid and gid of the last process to request a mount for on an
autofs dentry.

[akpm@linux-foundation.org: fix tpyo in comment]
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:39 -07:00
Ian Kent bb979d7fc3 autofs4: cleanup autofs mount type usage
Usage of the AUTOFS_TYPE_* defines is a little confusing and appears
inconsistent.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:39 -07:00
Tyler Hicks 624ae52845 eCryptfs: remove netlink transport
The netlink transport code has not worked for a while and the miscdev
transport is a simpler solution.  This patch removes the netlink code and
makes the miscdev transport the only eCryptfs kernel to userspace
transport.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:39 -07:00
Badari Pulavarty 807b7ebe41 ecryptfs: convert to use new aops
Convert ecryptfs to use write_begin/write_end

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:39 -07:00
Michael Halcrow 7d6c704558 eCryptfs: remove retry loop in ecryptfs_readdir()
The retry block in ecryptfs_readdir() has been in the eCryptfs code base
for a while, apparently for no good reason.  This loop could potentially
run without terminating.  This patch removes the loop, instead erroring
out if vfs_readdir() on the lower file fails.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Al Viro <viro@ZinIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:38 -07:00
Kirill A. Shutemov bf2a9a3963 Allow recursion in binfmt_script and binfmt_misc
binfmt_script and binfmt_misc disallow recursion to avoid stack overflow
using sh_bang and misc_bang.  It causes problem in some cases:

$ echo '#!/bin/ls' > /tmp/t0
$ echo '#!/tmp/t0' > /tmp/t1
$ echo '#!/tmp/t1' > /tmp/t2
$ chmod +x /tmp/t*
$ /tmp/t2
zsh: exec format error: /tmp/t2

Similar problem with binfmt_misc.

This patch introduces field 'recursion_depth' into struct linux_binprm to
track recursion level in binfmt_misc and binfmt_script.  If recursion
level more then BINPRM_MAX_RECURSION it generates -ENOEXEC.

[akpm@linux-foundation.org: make linux_binprm.recursion_depth a uint]
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:38 -07:00
Kirill A. Shutemov 53112488be alpha: introduce field 'taso' into struct linux_binprm
This change is Alpha-specific.  It adds field 'taso' into struct
linux_binprm to remember if the application is TASO.  Previously, field
sh_bang was used for this purpose.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:38 -07:00
Adrian Bunk cde162c2a9 binfmt_som.c: add MODULE_LICENSE
Add the missing MODULE_LICENSE("GPL").

Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:38 -07:00
Christoph Hellwig f7a5000f7a compat: move cp_compat_stat to common code
struct stat / compat_stat is the same on all architectures, so
cp_compat_stat should be, too.

Turns out it is, except that various architectures have slightly and some
high2lowuid/high2lowgid or the direct assignment instead of the
SET_UID/SET_GID that expands to the correct one anyway.

This patch replaces the arch-specific cp_compat_stat implementations with
a common one based on the x86-64 one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ]
Acked-by: Kyle McMartin <kyle@mcmartin.ca> [ parisc bits ]
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:33 -07:00
Francois Cami e1f8e87449 Remove Andrew Morton's old email accounts
People can use the real name an an index into MAINTAINERS to find the
current email address.

Signed-off-by: Francois Cami <francois.cami@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:32 -07:00
Davide Libenzi f337b9c583 epoll: drop unnecessary test
Thomas found that there is an unnecessary (always true) test in
ep_send_events().  The callback never inserts into ->rdllink while the
send loop is performed, and also does the ~EP_PRIVATE_BITS test.  Given
we're holding the mutex during this time, the conditions tested inside the
loop are always true.  This patch drops the test done inside the
re-insertion loop.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:32 -07:00
Jason Baron 362e6663ef exec.c, compat.c: fix count(), compat_count() bounds checking
With MAX_ARG_STRINGS set to 0x7FFFFFFF, and being passed to 'count()' and
compat_count(), it would appear that the current max bounds check of
fs/exec.c:394:

	if(++i > max)
		return -E2BIG;

would never trigger. Since 'i' is of type int, so values would wrap and the
function would continue looping.

Simple fix seems to be chaning ++i to i++ and checking for '>='.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Ollie Wild" <aaw@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:32 -07:00
Volodymyr G. Lukiianyk f4cfb18d79 uclinux: fix gzip header parsing in binfmt_flat.c
There are off-by-one errors in decompress_exec() when calculating the length of
optional "original file name" and "comment" fields: the "ret" index is not
incremented when terminating '\0' character is reached. The check of the buffer
overflow (after an "extra-field" length was taken into account) is also fixed.

I've encountered this off-by-one error when tried to reuse
gzip-header-parsing part of the decompress_exec() function.  There was an
"original file name" field in the payload (with miscalculated length) and
zlib_inflate() returned Z_DATA_ERROR.  But after the fix similar to this
one all worked fine.

Signed-off-by: Volodymyr G Lukiianyk <volodymyrgl@gmail.com>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:29 -07:00
Eric W. Biederman 0b4a4fea25 kobject: Cleanup kobject_rename and !CONFIG_SYSFS
It finally dawned on me what the clean fix to sysfs_rename_dir
calling kobject_set_name is.  Move the work into kobject_rename
where it belongs.  The callers serialize us anyway so this is
safe.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:52 -07:00
Trent Piepho 8c0e3998f5 sysfs: Make dir and name args to sysfs_notify() const
Because they can be, and because code like this produces a warning if
they're not:

struct device_attribute dev_attr;

sysfs_notify(&kobj, NULL, dev_attr.attr.name);

Signed-off-by: Trent Piepho <tpiepho@freescale.com>
CC: Neil Brown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:51 -07:00
Tejun Heo 45c076c5d7 sysfs: use ilookup5() instead of ilookup5_nowait()
As inode creation is protected by sysfs_mutex, ilookup5_nowait()
always either fails to find at all or finds one which is fully
initialized, so using ilookup5_nowait() or ilookup5() doesn't make any
difference.  Switch to ilookup5() as it's planned to be removed.  This
change also makes lookup return value handling a bit simpler.

This change was suggested by Al Viro.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@hera.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:51 -07:00
Nick Piggin b31ca3f5df sysfs: fix deadlock
On Thu, Sep 11, 2008 at 10:27:10AM +0200, Ingo Molnar wrote:

> and it's working fine on most boxes. One testbox found this new locking
> scenario:
>
> PM: Adding info for No Bus:vcsa7
> EDAC DEBUG: MC0: i82860_check()
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.27-rc6-tip #1
> -------------------------------------------------------
> X/4873 is trying to acquire lock:
>  (&bb->mutex){--..}, at: [<c020ba20>] mmap+0x40/0xa0
>
> but task is already holding lock:
>  (&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&mm->mmap_sem){----}:
>        [<c017dc96>] validate_chain+0xa96/0xf50
>        [<c017ef2b>] __lock_acquire+0x2cb/0x5b0
>        [<c017f299>] lock_acquire+0x89/0xc0
>        [<c01aa8fb>] might_fault+0x6b/0x90
>        [<c040b618>] copy_to_user+0x38/0x60
>        [<c020bcfb>] read+0xfb/0x170
>        [<c01c09a5>] vfs_read+0x95/0x110
>        [<c01c1443>] sys_pread64+0x63/0x80
>        [<c012146f>] sysenter_do_call+0x12/0x43
>        [<ffffffff>] 0xffffffff
>
> -> #0 (&bb->mutex){--..}:
>        [<c017d8b7>] validate_chain+0x6b7/0xf50
>        [<c017ef2b>] __lock_acquire+0x2cb/0x5b0
>        [<c017f299>] lock_acquire+0x89/0xc0
>        [<c0d6f2ab>] __mutex_lock_common+0xab/0x3c0
>        [<c0d6f698>] mutex_lock_nested+0x38/0x50
>        [<c020ba20>] mmap+0x40/0xa0
>        [<c01b111e>] mmap_region+0x14e/0x450
>        [<c01b170f>] do_mmap_pgoff+0x2ef/0x310
>        [<c0125a3d>] sys_mmap2+0xad/0xc0
>        [<c012146f>] sysenter_do_call+0x12/0x43
>        [<ffffffff>] 0xffffffff
>
> other info that might help us debug this:
>
> 1 lock held by X/4873:
>  #0:  (&mm->mmap_sem){----}, at: [<c0125a1e>] sys_mmap2+0x8e/0xc0
>
> stack backtrace:
> Pid: 4873, comm: X Not tainted 2.6.27-rc6-tip #1
>  [<c017cd09>] print_circular_bug_tail+0x79/0xc0
>  [<c017d8b7>] validate_chain+0x6b7/0xf50
>  [<c017a5b5>] ? trace_hardirqs_off_caller+0x15/0xb0
>  [<c017ef2b>] __lock_acquire+0x2cb/0x5b0
>  [<c017f299>] lock_acquire+0x89/0xc0
>  [<c020ba20>] ? mmap+0x40/0xa0
>  [<c0d6f2ab>] __mutex_lock_common+0xab/0x3c0
>  [<c020ba20>] ? mmap+0x40/0xa0
>  [<c0d6f698>] mutex_lock_nested+0x38/0x50
>  [<c020ba20>] ? mmap+0x40/0xa0
>  [<c020ba20>] mmap+0x40/0xa0
>  [<c01b111e>] mmap_region+0x14e/0x450
>  [<c01afb88>] ? arch_get_unmapped_area_topdown+0xf8/0x160
>  [<c01b170f>] do_mmap_pgoff+0x2ef/0x310
>  [<c0125a3d>] sys_mmap2+0xad/0xc0
>  [<c012146f>] sysenter_do_call+0x12/0x43
>  [<c0120000>] ? __switch_to+0x130/0x220
>  =======================
> evbug.c: Event. Dev: input3, Type: 20, Code: 0, Value: 500
> warning: `sudo' uses deprecated v2 capabilities in a way that may be insecure.
>
> i've attached the config.
>
> at first sight it looks like a genuine bug in fs/sysfs/bin.c?

Yes, it is a real bug by the looks. bin.c takes bb->mutex under mmap_sem
when it is mmapped, and then does its copy_*_user under bb->mutex too.

Here is a basic fix for the sysfs lor.


From: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:50 -07:00
Neil Brown f1282c844e sysfs: Support sysfs_notify from atomic context with new sysfs_notify_dirent
Support sysfs_notify from atomic context with new sysfs_notify_dirent

sysfs_notify currently takes sysfs_mutex.
This means that it cannot be called in atomic context.
sysfs_mutex  is sometimes held over a malloc (sysfs_rename_dir)
so it can block on low memory.

In md I want to be able to notify on a sysfs attribute from
atomic context, and I don't want to block on low memory because I
could be in the writeout path for freeing memory.

So:
 - export the "sysfs_dirent" structure along with sysfs_get, sysfs_put
   and sysfs_get_dirent so I can get the sysfs_dirent that I want to
   notify on and hold it in an md structure.
 - split sysfs_notify_dirent out of sysfs_notify so the sysfs_dirent
   can be notified on with no blocking (just a spinlock).

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:47 -07:00
Greg Kroah-Hartman a9b12619f7 device create: misc: convert device_create_drvdata to device_create
Now that device_create() has been audited, rename things back to the
original call to be sane.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:43 -07:00
Andrew Morton ae87221d3c sysfs: crash debugging
Print the name of the last-accessed sysfs file when we oops, to help track
down oopses which occur in sysfs store/read handlers.  Because these oopses
tend to not leave any trace of the offending code in the stack traces.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-16 09:24:41 -07:00
Johannes Berg 5f4123be3c remove CONFIG_KMOD from fs
Just always compile the code when the kernel is modular.
Convert load_nls to use try_then_request_module to tidy
up the code.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-10-17 02:38:36 +11:00
Thomas Gleixner 2be3b52a57 proc: fixup irq iterator
There is no need for irq_desc here. Even for sparse_irq we can
handle this clever in for_each_irq_nr().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-16 16:53:30 +02:00
Thomas Gleixner 2cc21ef843 genirq: remove sparse irq code
This code is not ready, but we need to rip it out instead of rebasing
as we would lose the APIC/IO_APIC unification otherwise.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-10-16 16:53:15 +02:00
Yinghai Lu 6d50bc2683 x86: use 28 bits irq NR for pci msi/msix and ht
also print out irq no in /proc/interrups and /proc/stat in hex, so could
tell bus/dev/func.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:52 +02:00
Yinghai Lu 52b17329d6 x86_64: make /proc/interrupts work with dyn irq_desc
loop with irq_desc list

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:51 +02:00
Yinghai Lu c7fb03a475 irq, fs/proc: replace loop with nr_irqs for proc/stat
Replace another nr_irqs loop to avoid the allocation of all sparse
irq entries - use for_each_irq_desc instead.

v2: make sure arch without GENERIC_HARDIRQS works too

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:33 +02:00
Yinghai Lu 7f95ec9e4c x86: move kstat_irqs from kstat to irq_desc
based on Eric's patch ...

together mold it with dyn_array for irq_desc, will allcate kstat_irqs for
nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already.

v2: make sure system without generic_hardirqs works they don't have irq_desc
v3: fix merging
v4: [mingo@elte.hu] fix typo

[ mingo@elte.hu ] irq: build fix

fix:

 arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow':
 arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs'

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:32 +02:00
Yinghai Lu da27c118eb fs/proc: use nr_irqs
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-16 16:52:07 +02:00
Aneesh Kumar K.V 22208dedbd ext4: Fix file fragmentation during large file write.
The range_cyclic writeback mode uses the address_space writeback_index
as the start index for writeback.  With delayed allocation we were
updating writeback_index wrongly resulting in highly fragmented file.
This patch reduces the number of extents reduced from 4000 to 27 for a
3GB file.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-16 10:10:36 -04:00
Tejun Heo a7c1b990f7 fuse: implement nonseekable open
Let the client request nonseekable open using FOPEN_NONSEEKABLE and
call nonseekable_open() on the file if requested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-16 16:08:57 +02:00
Tejun Heo 29d434b39c fuse: add include protectors
Add include protectors to include/linux/fuse.h and fs/fuse/fuse_i.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-16 16:08:57 +02:00
Robert P. J. Day 37194d0723 fuse: config description improvement
Make the short description of the FUSE_FS config option clearer.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-16 16:08:57 +02:00
Julia Lawall 17e18ab6ff fuse: add missing fuse_request_free
The error handling code for the second call to fuse_request_alloc should
include freeing the result of the first one.

This bug was found by the Coccinelle project:

  http://www.emn.fr/x-info/coccinelle/

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-16 16:08:56 +02:00
Miklos Szeredi 769415c611 fuse: fix SEEK_END incorrectness
Update file size before using it in lseek(..., SEEK_END).

Reported-by: Amnon Shiloh <u3557@miso.sublimeip.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2008-10-16 16:08:56 +02:00
Martin Schwidefsky 0b59268285 [PATCH] remove unused ibcs2/PER_SVR4 in SET_PERSONALITY
The SET_PERSONALITY macro is always called with a second argument of 0.
Remove the ibcs argument and the various tests to set the PER_SVR4
personality.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-10-16 15:40:05 +02:00
Trond Myklebust 6925bac120 Merge branch 'next' 2008-10-15 15:54:56 -04:00
Christoph Hellwig 6c5e51dae2 xfs: fix remount rw with unrecognized options
When we skip unrecognized options in xfs_fs_remount we should just break
out of the switch and not return because otherwise we may skip clearing
the xfs-internal read-only flag.  This will only show up on some
operations like touch because most read-only checks are done by the VFS
which thinks this filesystem is r/w.  Eventually we should replace the
XFS read-only flag with a helper that always checks the VFS flag to make
sure they can never get out of sync.

Bug reported and fix verified by Marcel Beister on #xfs.
Bug fix verified by updated xfstests/189.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-15 10:00:00 -07:00
Mark Fasheh 1efd47f873 ocfs2: fix build error
I merged the latest ocfs2_read_blocks() changes in xattr.c wrong. This makes
Ocfs2 compile again.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 18:31:46 -07:00
Linus Torvalds acd15a8360 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (56 commits)
  ocfs2: Make cached block reads the common case.
  ocfs2: Kill the last naked wait_on_buffer() for cached reads.
  ocfs2: Move ocfs2_bread() into dir.c
  ocfs2: Simplify ocfs2_read_block()
  ocfs2: Require an inode for ocfs2_read_block(s)().
  ocfs2: Separate out sync reads from ocfs2_read_blocks()
  ocfs2: Refactor xattr list and remove ocfs2_xattr_handler().
  ocfs2: Calculate EA hash only by its suffix.
  ocfs2: Move trusted and user attribute support into xattr.c
  ocfs2: Uninline ocfs2_xattr_name_hash()
  ocfs2: Don't check for NULL before brelse()
  ocfs2: use smaller counters in ocfs2_remove_xattr_clusters_from_cache
  ocfs2: Documentation update for user_xattr / nouser_xattr mount options
  ocfs2: make la_debug_mutex static
  ocfs2: Remove pointless !!
  ocfs2: Add empty bucket support in xattr.
  ocfs2/xattr.c: Fix a bug when inserting xattr.
  ocfs2: Add xattr mount option in ocfs2_show_options()
  ocfs2: Switch over to JBD2.
  ocfs2: Add the 'inode64' mount option.
  ...
2008-10-14 16:34:11 -07:00
Trond Myklebust 011935a0a7 NFS: Fix a resolution problem with nfs_inode->cache_change_attribute
The cache_change_attribute is used to decide whether or not a directory has
changed, in which case we may need to look it up again. Again, the use of
'jiffies' leads to an issue of resolution.

Once again, the fix is to change nfs_inode->cache_change_attribute, and
just make it a simple counter.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-14 19:24:50 -04:00
Trond Myklebust 4704f0e274 NFS: Fix the resolution problem with nfs_inode_attrs_need_update()
It appears that 'jiffies' timestamps do not have high enough resolution for
nfs_inode_attrs_need_update(). One problem is that a GETATTR can be
launched within < 1 jiffy of the last operation that updated the attribute.
Another problem is that RPC calls can take < 1 jiffy to execute.

We can fix this by switching the variables to use a simple global counter
that gets incremented every time we start another GETATTR call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-14 19:23:17 -04:00
Trond Myklebust 921615f111 NFS: Changes to inode->i_nlinks must set the NFS_INO_INVALID_ATTR flag
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-10-14 19:23:07 -04:00
Linus Torvalds 8acd3a60bc Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits)
  svcrdma: Fix IRD/ORD polarity
  svcrdma: Update svc_rdma_send_error to use DMA LKEY
  svcrdma: Modify the RPC reply path to use FRMR when available
  svcrdma: Modify the RPC recv path to use FRMR when available
  svcrdma: Add support to svc_rdma_send to handle chained WR
  svcrdma: Modify post recv path to use local dma key
  svcrdma: Add a service to register a Fast Reg MR with the device
  svcrdma: Query device for Fast Reg support during connection setup
  svcrdma: Add FRMR get/put services
  NLM: Remove unused argument from svc_addsock() function
  NLM: Remove "proto" argument from lockd_up()
  NLM: Always start both UDP and TCP listeners
  lockd: Remove unused fields in the nlm_reboot structure
  lockd: Add helper to sanity check incoming NOTIFY requests
  lockd: change nlmclnt_grant() to take a "struct sockaddr *"
  lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses
  lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET
  lockd: Support non-AF_INET addresses in nlm_lookup_host()
  NLM: Convert nlm_lookup_host() to use a single argument
  svcrdma: Add Fast Reg MR Data Types
  ...
2008-10-14 12:31:14 -07:00
Joel Becker d4a8c93c82 ocfs2: Make cached block reads the common case.
ocfs2_read_blocks() currently requires the CACHED flag for cached I/O.
However, that's the common case.  Let's flip it around and provide an
IGNORE_CACHE flag for the special users.  This has the added benefit of
cleaning up the code some (ignore_cache takes on its special meaning
earlier in the loop).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:58:22 -07:00
Joel Becker 5e0b3dec01 ocfs2: Kill the last naked wait_on_buffer() for cached reads.
ocfs2's cached buffer I/O goes through ocfs2_read_block(s)().  dir.c had
a naked wait_on_buffer() to wait for some readahead, but it should
use ocfs2_read_block() instead.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:58:11 -07:00
Joel Becker 07446dc72c ocfs2: Move ocfs2_bread() into dir.c
dir.c is the only place using ocfs2_bread(), so let's make it static to
that file.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:58:03 -07:00
Joel Becker 0fcaa56a2a ocfs2: Simplify ocfs2_read_block()
More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED.
Only six pass a different flag set.  Rather than have every caller care,
let's make ocfs2_read_block() take no flags and always do a cached read.
The remaining six places can call ocfs2_read_blocks() directly.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:51:57 -07:00
Joel Becker 31d33073ca ocfs2: Require an inode for ocfs2_read_block(s)().
Now that synchronous readers are using ocfs2_read_blocks_sync(), all
callers of ocfs2_read_blocks() are passing an inode.  Use it
unconditionally.  Since it's there, we don't need to pass the
ocfs2_super either.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:43:29 -07:00
Joel Becker da1e90985a ocfs2: Separate out sync reads from ocfs2_read_blocks()
The ocfs2_read_blocks() function currently handles sync reads, cached,
reads, and sometimes cached reads.  We're going to add some
functionality to it, so first we should simplify it.  The uncached,
synchronous reads are much easer to handle as a separate function, so we
instroduce ocfs2_read_blocks_sync().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-14 11:29:10 -07:00
Aneesh Kumar K.V af6f029d38 ext4: Use tag dirty lookup during mpage_da_submit_io
This enables us to drop the range_cont writeback mode
use from ext4_da_writepages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-10-14 09:20:19 -04:00
Theodore Ts'o 8a0aba733d ext4: let the block device know when unused blocks can be discarded
Let the block device know when unused blocks can be discarded, using
the new sb_issue_discard() interface.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-16 10:06:27 -04:00
Tao Ma 936b883436 ocfs2: Refactor xattr list and remove ocfs2_xattr_handler().
According to Christoph Hellwig's advice, we really don't need
a ->list to handle one xattr's list. Just a map from index to
xattr prefix is enough. And I also refactor the old list method
with the reference from fs/xfs/linux-2.6/xfs_xattr.c and the
xattr list method in btrfs.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:45 -07:00
Tao Ma 2057e5c678 ocfs2: Calculate EA hash only by its suffix.
According to Christoph Hellwig's advice, the hash value of EA
is only calculated by its suffix.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Mark Fasheh 99219aea68 ocfs2: Move trusted and user attribute support into xattr.c
Per Christoph Hellwig's suggestion - don't split these up. It's not like we
gained much by having the two tiny files around.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Mark Fasheh 40daa16a34 ocfs2: Uninline ocfs2_xattr_name_hash()
This is too big to be inlined.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Mark Fasheh a81cb88b64 ocfs2: Don't check for NULL before brelse()
This is pointless as brelse() already does the check.

Signed-off-by: Mark Fasheh
2008-10-13 17:02:44 -07:00
Mark Fasheh fd8351f83d ocfs2: use smaller counters in ocfs2_remove_xattr_clusters_from_cache
i and b_len don't really need to be u64's. Xattr extent lengths should be
limited by the VFS, and then the size of our on-disk length field.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Mark Fasheh 4cc8124584 ocfs2: make la_debug_mutex static
It can also be moved into ocfs2_la_debug_read().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Mark Fasheh 009d37502a ocfs2: Remove pointless !!
ocfs2_stack_supports_plocks() doesn't need this to properly return a zero or
one value.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:44 -07:00
Tao Ma 5a09561199 ocfs2: Add empty bucket support in xattr.
As Mark mentioned, it may be time-consuming when we remove the
empty xattr bucket, so this patch try to let empty bucket exist
in xattr operation. The modification includes:
1. Remove the functin of bucket and extent record deletion during
   xattr delete.
2. In xattr set:
   1) Don't clean the last entry so that if the bucket is empty,
      the hash value of the bucket is the hash value of the entry
      which is deleted last.
   2) During insert, if we meet with an empty bucket, just use the
      1st entry.
3. In binary search of xattr bucket, use the bucket hash value(which
   stored in the 1st xattr entry) to find the right place.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:43 -07:00
Tao Ma 06b240d8af ocfs2/xattr.c: Fix a bug when inserting xattr.
During the process of xatt insertion, we use binary search
to find the right place and "low" is set to it. But when
there is one xattr which has the same name hash as the inserted
one, low is the wrong value. So set it to the right position.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:43 -07:00
Sunil Mushran b0f73cfc36 ocfs2: Add xattr mount option in ocfs2_show_options()
Patch adds check for [no]user_xattr in ocfs2_show_options() that completes
the list of all mount options.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:43 -07:00
Joel Becker 2b4e30fbde ocfs2: Switch over to JBD2.
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.

It's a pretty trivial change.  Most functions are just renamed.  The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.

Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem.  It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.

We provide a compatibility option so that paranoid people can still use
JBD for the time being.  This will go away shortly.

[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
  ocfs2_truncate_for_delete(). --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:43 -07:00
Joel Becker 12462f1d9f ocfs2: Add the 'inode64' mount option.
Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit.  This parallels XFS.  64bit systems can handle the
larger inode numbers.

[ Added description of inode64 mount option in ocfs2.txt. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:08 -07:00
Joel Becker 1187c96885 ocfs2: Limit inode allocation to 32bits.
ocfs2 inode numbers are block numbers.  For any filesystem with less
than 2^32 blocks, this is not a problem.  However, when ocfs2 starts
using JDB2, it will be able to support filesystems with more than 2^32
blocks.  This would result in inode numbers higher than 2^32.

The problem is that stat(2) can't handle those numbers on 32bit
machines.  The simple solution is to have ocfs2 allocate all inodes
below that boundary.

The suballoc code is changed to honor an optional block limit.  Only the
inode suballocator sets that limit - all other allocations stay unlimited.

The biggest trick is to grow the inode suballocator beneath that limit.
There's no point in allocating block groups that are above the limit,
then rejecting their elements later on.  We want to prevent the inode
allocator from ever having block groups above the limit.  This involves
a little gyration with the local alloc code.  If the local alloc window
is above the limit, it signals the caller to try the global bitmap but
does not disable the local alloc file (which can be used for other
allocations).

[ Minor cleanup - removed an ML_NOTICE comment. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:07 -07:00
Tao Ma 08413899db ocfs2: Resolve deadlock in ocfs2_xattr_free_block.
In ocfs2_xattr_free_block, we take a cluster lock on xb_alloc_inode while we
have a transaction open. This will deadlock the downconvert thread, so fix
it.

We can clean up how xattr blocks are removed while here - this patch also
moves the mechanism of releasing xattr block (including both value, xattr
tree and xattr block) into this function.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:06 -07:00
Tao Ma 28b8ca0b7f ocfs2: bug-fix for journal extend in xattr.
In ocfs2_extend_trans, when we can't extend the current
transaction, it will commit current transaction and restart
a new one. So if the previous credits we have allocated aren't
used(the block isn't dirtied before our extend), we will not
have enough credits for any future operation(it will cause jbd
complain and bug out). So check this and re-extend it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:06 -07:00
Joel Becker 8d6220d6a7 ocfs2: Change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree()
The original get/put_extent_tree() functions held a reference on
et_root_bh.  However, every single caller already has a safe reference,
making the get/put cycle irrelevant.

We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree().  It
no longer gets a reference on et_root_bh.  ocfs2_put_extent_tree() is
removed.  Callers now have a simpler init+use pattern.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:05 -07:00
Joel Becker 1625f8ac15 ocfs2: Comment struct ocfs2_extent_tree_operations.
struct ocfs2_extent_tree_operations provides methods for the different
on-disk btrees in ocfs2.  Describing what those methods do is probably a
good idea.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:05 -07:00
Joel Becker f99b9b7ccf ocfs2: Make ocfs2_extent_tree the first-class representation of a tree.
We now have three different kinds of extent trees in ocfs2: inode data
(dinode), extended attributes (xattr_tree), and extended attribute
values (xattr_value).  There is a nice abstraction for them,
ocfs2_extent_tree, but it is hidden in alloc.c.  All the calling
functions have to pick amongst a varied API and pass in type bits and
often extraneous pointers.

A better way is to make ocfs2_extent_tree a first-class object.
Everyone converts their object to an ocfs2_extent_tree() via the
ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
tree calls to alloc.c.

This simplifies a lot of callers, making for readability.  It also
provides an easy way to add additional extent tree types, as they only
need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree()
function.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:05 -07:00
Joel Becker 1e61ee79e2 ocfs2: Add an insertion check to ocfs2_extent_tree_operations.
A couple places check an extent_tree for a valid inode.  We move that
out to add an eo_insert_check() operation.  It can be called from
ocfs2_insert_extent() and elsewhere.

We also have the wrapper calls ocfs2_et_insert_check() and
ocfs2_et_sanity_check() ignore NULL ops.  That way we don't have to
provide useless operations for xattr types.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:05 -07:00
Joel Becker 1a09f556e5 ocfs2: Create specific get_extent_tree functions.
A caller knows what kind of extent tree they have.  There's no reason
they have to call ocfs2_get_extent_tree() with a NULL when they could
just as easily call a specific function to their type of extent tree.

Introduce ocfs2_dinode_get_extent_tree(),
ocfs2_xattr_tree_get_extent_tree(), and
ocfs2_xattr_value_get_extent_tree().  They only take the necessary
arguments, calling into the underlying __ocfs2_get_extent_tree() to do
the real work.

__ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but
without needing any switch-by-type logic.

ocfs2_get_extent_tree() is now a wrapper around the specific calls.  It
exists because a couple alloc.c functions can take et_type.  This will
go later.

Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a
struct ocfs2_xattr_value_root* instead of void*.  This gives us
typechecking where we didn't have it before.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:05 -07:00
Joel Becker 943cced39e ocfs2: Determine an extent tree's max_leaf_clusters in an et_op.
Provide an optional extent_tree_operation to specify the
max_leaf_clusters of an ocfs2_extent_tree.  If not provided, the value
is 0 (unlimited).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker 1c25d93a4a ocfs2: Use struct ocfs2_extent_tree in ocfs2_num_free_extents().
ocfs2_num_free_extents() re-implements the logic of
ocfs2_get_extent_tree().  Now that ocfs2_get_extent_tree() does not
allocate, let's use it in ocfs2_num_free_extents() to simplify the code.

The inode validation code in ocfs2_num_free_extents() is not needed.
All callers are passing in pre-validated inodes.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker 0ce1010f1a ocfs2: Provide the get_root_el() method to ocfs2_extent_tree_operations.
The root_el of an ocfs2_extent_tree needs to be calculated from
et->et_object.  Make it an operation on et->et_ops.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker ea5efa1512 ocfs2: Make 'private' into 'object' on ocfs2_extent_tree.
The 'private' pointer was a way to store off xattr values, which don't
live at a set place in the bh.  But the concept of "the object
containing the extent tree" is much more generic.  For an inode it's the
struct ocfs2_dinode, for an xattr value its the value.  Let's save off
the 'object' at all times.  If NULL is passed to
ocfs2_get_extent_tree(), 'object' is set to bh->b_data;

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker dc0ce61af4 ocfs2: Make ocfs2_extent_tree get/put instead of alloc.
Rather than allocating a struct ocfs2_extent_tree, just put it on the
stack.  Fill it with ocfs2_get_extent_tree() and drop it with
ocfs2_put_extent_tree().  Now the callers don't have to ENOMEM, yet
still safely ref the root_bh.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker ce1d9ea621 ocfs2: Prefix the ocfs2_extent_tree structure.
The members of the ocfs2_extent_tree structure gain a prefix of 'et_'.
All users are updated.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Joel Becker 35dc0aa3c5 ocfs2: Prefix the extent tree operations structure.
The ocfs2_extent_tree_operations structure gains a field prefix on its
members.  The ->eo_sanity_check() operation gains a wrapper function for
completeness.  All of the extent tree operation wrappers gain a
consistent name (ocfs2_et_*()).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:04 -07:00
Mark Fasheh ff1ec20ef6 ocfs2: fix printk format warnings
This patch fixes the following build warnings:

fs/ocfs2/xattr.c: In function 'ocfs2_half_xattr_bucket':
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c: In function 'ocfs2_xattr_set_entry_in_bucket':
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tiger Yang 8154da3d21 ocfs2: Add incompatible flag for extended attribute
This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma a394425643 ocfs2: Delete all xattr buckets during inode removal
In inode removal, we need to iterate all the buckets, remove any
externally-stored EA values and delete the xattr buckets.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma 012255961c ocfs2: Enable xattr set in index btree
Where the previous patches added the ability of list/get xattr in buckets
for ocfs2, this patch enables ocfs2 to store large numbers of EAs.

The original design doc is written by Mark Fasheh, and it can be found in
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/IndexedEATrees. I only had to
make small modifications to it.

First, because the bucket size is 4K, a new field named xh_free_start is added
in ocfs2_xattr_header to indicate the next valid name/value offset in a bucket.
It is used when we store new EA name/value. With this field, we can find the
place more quickly and what's more, we don't need to sort the name/value every
time to let the last entry indicate the next unused space. This makes the
insert operation more efficient for blocksizes smaller than 4k.

Because of the new xh_free_start, another field named as xh_name_value_len is
also added in ocfs2_xattr_header. It records the total length of all the
name/values in the bucket. We need this so that we can check it and defragment
the bucket if there is not enough contiguous free space.

An xattr insertion looks like this:
1. xattr_index_block_find: find the right bucket by the name_hash, say bucketA.
2. check whether there is enough space in bucketA. If yes, insert it directly
   and modify xh_free_start and xh_name_value_len accordingly. If not, check
   xh_name_value_len to see whether we can store this by defragment the bucket.
   If yes, defragment it and go on insertion.
3. If defragement doesn't work, check whether there is new empty bucket in
   the clusters within this extent record. If yes, init the new bucket and move
   all the buckets after bucketA one by one to the next bucket. Move half of the
   entries in bucketA to the next bucket and go on insertion.
4. If there is no new bucket, grow the extent tree.

As for xattr deletion, we will delete an xattr bucket when all it's xattrs
are removed and move all the buckets after it to the previous one. When all
the xattr buckets in an extend record are freed, free this extend records
from ocfs2_xattr_tree.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma ca12b7c489 ocfs2: Optionally limit extent size in ocfs2_insert_extent()
In xattr bucket, we want to limit the maximum size of a btree leaf,
otherwise we'll lose the benefits of hashing because we'll have to search
large leaves.

So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster
size we want so that we can prevent ocfs2_insert_extent() from merging the leaf
record even if it is contiguous with an adjacent record.

Other btree types are not affected by this change.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma 589dc2602f ocfs2: Add xattr lookup code xattr btrees
Add code to lookup a given extended attribute in the xattr btree. Lookup
follows this general scheme:

1. Use ocfs2_xattr_get_rec to find the xattr extent record

2. Find the xattr bucket within the extent which may contain this xattr

3. Iterate the bucket to find the xattr. In ocfs2_xattr_block_get(), we need
   to recalcuate the block offset and name offset for the right position of
   name/value.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma 0c044f0b24 ocfs2: Add xattr bucket iteration for large numbers of EAs
Ocfs2 breaks up xattr index tree leaves into 4k regions, called buckets.
Attributes are stored within a given bucket, depending on hash value.

After a discussion with Mark, we decided that the per-bucket index
(xe_entry[]) would only exist in the 1st block of a bucket. Likewise,
name/value pairs will not straddle more than one block. This allows the
majority of operations to work directly on the buffer heads in a leaf block.

This patch adds code to iterate the buckets in an EA. A new abstration of
ocfs2_xattr_bucket is added. It records the bhs in this bucket and
ocfs2_xattr_header. This keeps the code neat, improving readibility.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tao Ma ba492615f0 ocfs2: Add xattr index tree operations
When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to
store large numbers of EAs. This patch adds a new type in
ocfs2_extent_tree_type and adds the implementation so that we can re-use the
b-tree code to handle the storage of many EAs.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:02 -07:00
Tiger Yang cf1d6c763f ocfs2: Add extended attribute support
This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:02 -07:00
Tiger Yang fdd77704a8 ocfs2: reserve inline space for extended attribute
Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:02 -07:00
Tao Ma f56654c435 ocfs2: Add extent tree operation for xattr value btrees
Add some thin wrappers around ocfs2_insert_extent() for each of the 3
different btree types, ocfs2_inode_insert_extent(),
ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
last is for the xattr index btree, which will be used in a followup patch.

All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
while the other two handle the xattr issue. And the init of extent tree are
handled by these functions.

When storing xattr value which is too large, we will allocate some clusters
for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
order to re-use the b-tree operation code, a new parameter named "private"
is added into ocfs2_extent_tree and it is used to indicate the root of
ocfs2_exent_list. The reason is that we can't deduce the root from the
buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
in any place in an ocfs2_xattr_bucket.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:01 -07:00
Tao Ma ac11c82719 ocfs2: Add helper function in uptodate.c for removing xattr clusters
The old uptodate only handles the issue of removing one buffer_head from
ocfs2 inode's buffer cache. With xattr clusters, we may need to remove
multiple buffer_head's at a time.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:59 -07:00
Tao Ma 5a7bc8eb29 ocfs2: Add the basic xattr disk layout in ocfs2_fs.h
Ocfs2 uses a very flexible structure for storing extended attributes on
disk. Small amount of attributes are stored directly in the inode block - up
to 256 bytes worth. If that fills up, attributes are also stored in an
external block, linked to from the inode block. That block can in turn
expand to a btree, capable of storing large numbers of attributes.

Individual attribute values are stored inline if they're small enough
(currently about 80 bytes, this can be changed though), and otherwise are
expanded to a btree. The theoretical limit to the size of an individual
attribute is about the same as an inode, though the kernel's upper bound on
the size of an attributes data is far smaller.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:59 -07:00
Tao Ma 0eb8d47e69 ocfs2: Make high level btree extend code generic
Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
ocfs2_do_cluster_allocation() now, but the latter can be used for other
btree types as well.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:59 -07:00
Tao Ma e7d4cb6bc1 ocfs2: Abstract ocfs2_extent_tree in b-tree operations.
In the old extent tree operation, we take the hypothesis that we
are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
As xattr will also use ocfs2_extent_list to store large value
for a xattr entry, we refactor the tree operation so that xattr
can use it directly.

The refactoring includes 4 steps:
1. Abstract set/get of last_eb_blk and update_clusters since they may
   be stored in different location for dinode and xattr.
2. Add a new structure named ocfs2_extent_tree to indicate the
   extent tree the operation will work on.
3. Remove all the use of fe_bh and di, use root_bh and root_el in
   extent tree instead. So now all the fe_bh is replaced with
   et->root_bh, el with root_el accordingly.
4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
   in file extend allocation. But the whole function is useful when we want
   to store large EAs.

Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
for anything other than truncate inode data btrees.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:58 -07:00
Tao Ma 811f933df1 ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode.
ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
they are all limited to an inode btree because they use a struct
ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
(the part of an ocfs2_dinode they actually use) so that the xattr btree code
can use these functions.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:58 -07:00
Tao Ma 231b87d109 ocfs2: Modify ocfs2_num_free_extents for future xattr usage.
ocfs2_num_free_extents() is used to find the number of free extent records
in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
use this for extended attribute trees in the future, so genericize the
interface the take a buffer head. A future patch will allow that buffer_head
to contain any structure rooting an ocfs2 btree.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:58 -07:00
Mark Fasheh 9a8ff578fb ocfs2: track local alloc state via debugfs
A per-mount debugfs file, "local_alloc" is created which when read will
expose live state of the nodes local alloc file. Performance impact is
minimal, only a bit of memory overhead per mount point. Still, the code is
hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
local alloc performance problems on a live system.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:58 -07:00
Mark Fasheh 9c7af40b21 ocfs2: throttle back local alloc when low on disk space
Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.

To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.

Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:57 -07:00
Mark Fasheh ebcee4b5c9 ocfs2: Track local alloc bits internally
Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:57 -07:00
Mark Fasheh 53da4939f3 ocfs2: POSIX file locks support
This is actually pretty easy since fs/dlm already handles the bulk of the
work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the
underlying lock manager, so I only had to add the right calls.

Cluster-aware POSIX locks ("plocks") can be turned off by the same means at
UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume.
Internally, the file system uses two sets of file_operations, depending on
whether cluster aware plocks is required. This turns out to be easier than
implementing local-only versions of ->lock.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:57 -07:00
Steven Whitehouse a447c09324 vfs: Use const for kernel parser table
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.

This was posted for review some time ago and I believe its been in -mm
since then.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 10:10:37 -07:00
Linus Torvalds 20272c8994 Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc
* 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
  proc: remove kernel.maps_protect
  proc: remove now unneeded ADDBUF macro
  [PATCH] proc: show personality via /proc/pid/personality
  [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock()
  proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig
  proc: make grab_header() static
  proc: remove unused get_dma_list()
  proc: remove dummy vmcore_open()
  proc: proc_sys_root tweak
  proc: fix return value of proc_reg_open() in "too late" case

Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h
2008-10-13 10:04:04 -07:00
Linus Torvalds 8d71ff0bef Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (24 commits)
  integrity: special fs magic
  As pointed out by Jonathan Corbet, the timer must be deleted before
  ERROR: code indent should use tabs where possible
  The tpm_dev_release function is only called for platform devices, not pnp
  Protect tpm_chip_list when transversing it.
  Renames num_open to is_open, as only one process can open the file at a time.
  Remove the BKL calls from the TPM driver, which were added in the overall
  netlabel: Add configuration support for local labeling
  cipso: Add support for native local labeling and fixup mapping names
  netlabel: Changes to the NetLabel security attributes to allow LSMs to pass full contexts
  selinux: Cache NetLabel secattrs in the socket's security struct
  selinux: Set socket NetLabel based on connection endpoint
  netlabel: Add functionality to set the security attributes of a packet
  netlabel: Add network address selectors to the NetLabel/LSM domain mapping
  netlabel: Add a generic way to create ordered linked lists of network addrs
  netlabel: Replace protocol/NetLabel linking with refrerence counts
  smack: Fix missing calls to netlbl_skbuff_err()
  selinux: Fix missing calls to netlbl_skbuff_err()
  selinux: Fix a problem in security_netlbl_sid_to_secattr()
  selinux: Better local/forward check in selinux_ip_postroute()
  ...
2008-10-13 10:00:44 -07:00
Linus Torvalds 244dc4e54b Merge git://git.infradead.org/users/dwmw2/random-2.6
* git://git.infradead.org/users/dwmw2/random-2.6:
  Fix autoloading of MacBook Pro backlight driver.
  Automatic MODULE_ALIAS() for DMI match tables.
  Remove asm/a.out.h files for all architectures without a.out support.
  Introduce HAVE_AOUT symbol to remove hard-coded arch list for BINFMT_AOUT
  Remove redundant CONFIG_ARCH_SUPPORTS_AOUT
  S390: Update comments about why we don't use <asm-generic/statfs.h>
  SPARC: Use <asm-generic/statfs.h>
  PowerPC: Use <asm-generic/statfs.h>
  PARISC: Use <asm-generic/statfs.h>
  x86_64: Use <asm-generic/statfs.h>
  IA64: Use <asm-generic/statfs.h>
  ARM: Use <asm-generic/statfs.h>
  Make <asm-generic/statfs.h> suitable for 64-bit platforms.
  Define and use PCI_DEVICE_ID_MARVELL_88ALP01_CCIC for CAFÉ camera driver
  [MTD] [NAND] Define and use PCI_DEVICE_ID_MARVELL_88ALP01_NAND for CAFÉ
  Use PCI_DEVICE_ID_88ALP01 for CAFÉ chip, rather than PCI_DEVICE_ID_CAFE.
  EFS: Don't set f_fsid in statfs().
2008-10-13 09:59:14 -07:00
Sukadev Bhattiprolu a6f37daa8b Simplify devpts_pty_kill
When creating a new pty, save the pty's inode in the tty->driver_data.
Use this inode in pty_kill() to identify the devpts instance. Since
we now have the inode for the pty, we can skip get_node() lookup and
remove the unused get_node().

TODO:
	- check if the mutex_lock is needed in pty_kill().

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu 89a52e109e Simplify devpts_pty_new()
devpts_pty_new() is called when setting up a new pty and would not
will not have an existing dentry or inode for the pty. So don't bother
looking for an existing dentry - just create a new one.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu 527b3e4773 Simplify devpts_get_tty()
As pointed out by H. Peter Anvin, since the inode for the pty is known,
we don't need to look it up.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu 15f1a6338d Add an instance parameter devpts interfaces
Pass-in 'inode' or 'tty' parameter to devpts interfaces.  With multiple
devpts instances, these parameters will be used in subsequent patches
to identify the instance of devpts mounted. The parameters also help
simplify devpts implementation.

Changelog[v3]:
	- minor changes due to merge with ttydev updates
	- rename parameters to emphasize they are ptmx or pts inodes
	- pass-in tty_struct * to devpts_pty_kill() (this will help
	  cleanup the get_node() call in a subsequent patch)

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:43 -07:00
Alan Cox 934e6ebf96 tty: Redo current tty locking
Currently it is sometimes locked by the tty mutex and sometimes by the
sighand lock. The latter is in fact correct and now we can hand back referenced
objects we can fix this up without problems around sleeping functions.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:41 -07:00
Alan Cox 2cb5998b5f tty: the vhangup syscall is racy
We now have the infrastructure to sort this out but rather than teaching
the syscall tty lock rules we move the hard work into a tty helper

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:41 -07:00
Alan Cox 452a00d2ee tty: Make get_current_tty use a kref
We now return a kref covered tty reference. That ensures the tty structure
doesn't go away when you have a return from get_current_tty. This is not
enough to protect you from most of the resources being freed behind your
back - yet.

[Updated to include fixes for SELinux problems found by Andrew Morton and
 an s390 leak found while debugging the former]

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 09:51:41 -07:00
David Woodhouse e758936e02 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	include/asm-x86/statfs.h
2008-10-13 17:13:56 +01:00
Linus Torvalds 3280fb3139 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix kconfig typo and extra whitespace
  ext4: fix build failure without procfs
  ext4: add an option to control error handling on file data
  jbd2: don't dirty original metadata buffer on abort
  ext4: add checks for errors from jbd2
  jbd2: fix error handling for checkpoint io
  jbd2: abort when failed to log metadata buffers
2008-10-12 16:10:29 -07:00
Mimi Zohar 9256292782 integrity: special fs magic
Discussion on the mailing list questioned the use of these
magic values in userspace, concluding these values are already
exported to userspace via statfs and their correct/incorrect
usage is left up to the userspace application.

  - Move special fs magic number definitions to magic.h
  - Add magic.h include

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-10-13 09:47:43 +11:00
Jan Engelhardt f319fb8bf6 ext4: fix kconfig typo and extra whitespace
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-12 15:53:01 -04:00
Alexander Beregalov 3244fcb1ae ext4: fix build failure without procfs
fs/ext4/super.c: In function 'ext4_fill_super':
fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use
in this function)
fs/ext4/super.c:2226: error: (Each undeclared identifier is reported
only once
fs/ext4/super.c:2226: error: for each function it appears in.)

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-12 17:27:49 -04:00
Linus Torvalds f1b2a5ace9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble
2008-10-12 12:42:36 -07:00
Adrian Bunk 06270d5d6a provide generic_block_fiemap() only with BLOCK=y
This fixes the following compile error with CONFIG_BLOCK=n caused by
commit 68c9d702bb ("generic block based
fiemap implementation"):

    CC      fs/ioctl.o
  fs/ioctl.c: In function 'generic_block_fiemap':
  fs/ioctl.c:249: error: storage size of 'tmp' isn't known
  fs/ioctl.c:272: error: invalid application of 'sizeof' to incomplete type 'struct buffer_head'
  fs/ioctl.c:280: error: implicit declaration of function 'buffer_mapped'
  fs/ioctl.c:249: warning: unused variable 'tmp'
  make[2]: *** [fs/ioctl.o] Error 1

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-12 11:44:37 -07:00
Jeff Layton 14835a3325 [CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble
We lock GlobalMid_Lock in header_assemble and then immediately unlock it
again without doing anything. Not sure what this was intended to do, but
remove it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-10-12 13:34:11 +00:00
Linus Torvalds fd04808830 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
  ext4: Rename ext4dev to ext4
  ext4: Avoid double dirtying of super block in ext4_put_super()
  Update ext4 MAINTAINERS file
  Hook ext4 to the vfs fiemap interface.
  generic block based fiemap implementation
  ocfs2: fiemap support
  vfs: vfs-level fiemap interface
  ext4: fix xattr deadlock
  jbd2: Fix buffer head leak when writing the commit block
  ext4: Add debugging markers that can be used by systemtap
  jbd2: abort instead of waiting for nonexistent transaction
  ext4: fix initialization of UNINIT bitmap blocks
  ext4: Remove old legacy block allocator
  ext4: Use readahead when reading an inode from the inode table
  ext4: Improve the documentation for ext4's /proc tunables
  ext4: Combine proc file handling into a single set of functions
  ext4: move /proc setup and teardown out of mballoc.c
  ext4: Don't use 'struct dentry' for internal lookups
  ext4/jbd2: Avoid WARN() messages when failing to write to the superblock
  ext4: use percpu data structures for lg_prealloc_list
  ...
2008-10-11 13:23:48 -07:00
Linus Torvalds 86ed5a93b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Check that last search entry resume key is valid
  [CIFS] make sure we have the right resume info before calling CIFSFindNext
  [CIFS]  clean up error handling in cifs_unlink
  [CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo
  cifs: explicitly revoke SPNEGO key after session setup
  cifs: Convert cifs to new aops.
  [CIFS] update DOS attributes in cifsInode if we successfully changed them
  cifs: remove NULL termination from rename target in CIFSSMBRenameOpenFIle
  cifs: work around samba returning -ENOENT on SetFileDisposition call
  cifs: fix inverted NULL check after kmalloc
  [CIFS] clean up upcall handling for dns_resolver keys
  [CIFS]  fix busy-file renames and refactor cifs_rename logic
  cifs: add function to set file disposition
  [CIFS] add constants for string lengths of keynames in SPNEGO upcall string
  cifs: move rename and delete-on-close logic into helper function
  cifs: have find_writeable_file prefer filehandles opened by same task
  cifs: don't use GFP_KERNEL with GFP_NOFS
  [CIFS] use common code for turning off ATTR_READONLY in cifs_unlink
  cifs: clean up variables in cifs_unlink
2008-10-11 09:31:53 -07:00
Hidehiro Kawai 5bf5683a33 ext4: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data
blocks, the file data corruption will spread silently.  Because
most of applications and commands do buffered writes without fsync(),
they don't notice the IO error.  It's scary for mission critical
systems.  On the other hand, if the journal aborts whenever it gets
an IO error in file data blocks, the system will easily become
inoperable.  So this patch introduces a filesystem option to
determine whether it aborts the journal or just call printk() when
it gets an IO error in file data.

If you mount an ext4 fs with data_err=abort option, it aborts on file
data write error.  If you mount it with data_err=ignore, it doesn't
abort, just call printk().  data_err=ignore is the default.

Here is the corresponding patch of the ext3 version:
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 22:12:43 -04:00
Hidehiro Kawai 7ad7445f60 jbd2: don't dirty original metadata buffer on abort
Currently, original metadata buffers are dirtied when they are
unfiled whether the journal has aborted or not.  Eventually these
buffers will be written-back to the filesystem by pdflush.  This
means some metadata buffers are written to the filesystem without
journaling if the journal aborts.  So if both journal abort and
system crash happen at the same time, the filesystem would become
inconsistent state.  Additionally, replaying journaled metadata
can overwrite the latest metadata on the filesystem partly.
Because, if the journal gets aborted, journaled metadata are
preserved and replayed during the next mount not to lose
uncheckpointed metadata.  This would also break the consistency
of the filesystem.

This patch prevents original metadata buffers from being dirtied
on abort by clearing BH_JBDDirty flag from those buffers.  Thus,
no metadata buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:29:31 -04:00
Hidehiro Kawai 7ffe1ea894 ext4: add checks for errors from jbd2
If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space.  Otherwise, the
filesystem will lose uncheckpointed metadata completely and
become inconsistent.  To avoid this, we need to keep needs_recovery
flag if checkpoint has failed.

With this patch, ext4_put_super() detects a checkpointing failure
from the return value of journal_destroy(), then it invokes
ext4_abort() to make the filesystem read only and keep
needs_recovery flag.  Errors from jbd2_journal_flush() are also
handled by this patch in some places.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:29:21 -04:00
Hidehiro Kawai 44519faf22 jbd2: fix error handling for checkpoint io
When a checkpointing IO fails, current JBD2 code doesn't check the
error and continue journaling.  This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of jbd2_log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
   the case of __try_to_free_cp_buf() because it may be released or
   overwritten by a later transaction
2. jbd2_log_do_checkpoint() is the last chance, remove the failed
   buffer from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
   prevent the journaled contents from being cleaned.  For safety,
   don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext4 layer so
   that ext4 don't clear the needs_recovery flag, otherwise the
   journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent jbd2_cleanup_journal_tail() from being called between
   __jbd2_journal_drop_transaction() and jbd2_journal_abort()
   (a possible race issue between jbd2_log_do_checkpoint()s called by
   jbd2_journal_flush() and __jbd2_log_wait_for_space())

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:29:13 -04:00
Hidehiro Kawai 77e841de8a jbd2: abort when failed to log metadata buffers
If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

We can also avoid this kind of corruption by using the journal
checksum feature because it can detect invalid metadata blocks in the
journal and avoid them from being replayed.  So we don't need to care
about asynchronous commit record writeout with a checksum.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-12 16:39:16 -04:00
Aneesh Kumar K.V a1aebc1e2d ext4: Don't reuse released data blocks until transaction commits
We need to make sure we don't reuse the data blocks released
during the transaction untill the transaction commits. We force
this mode only for ordered and journalled mode. Writeback mode
already don't provided data consistency.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:13:31 -04:00
Aneesh Kumar K.V c894058d66 ext4: Use an rbtree for tracking blocks freed during transaction.
With this patch we track the block freed during a transaction using
red-black tree.  We also make sure contiguous blocks freed are collected
in one node in the tree.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-16 10:14:27 -04:00
Aneesh Kumar K.V c2774d84fd ext4: Do mballoc init before doing filesystem recovery
During filesystem recovery we may be doing a truncate
which expects some of the mballoc data structures to
be initialized. So do ext4_mb_init before recovery.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:07:20 -04:00
Aneesh Kumar K.V 688f05a019 ext4: Free ext4_prealloc_space using kmem_cache_free
We should use kmem_cache_free to free memory allocated
via kmem_cache_alloc

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-13 12:14:14 -04:00
Manish Katiyar 473dc8eddb ext4: Fix Kconfig typo for ext4dev
Looks like there is one more instance where ext4dev should be changed
to ext4 because the module name will be "ext4" unless EXT4DEV_COMPAT
is selected.

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-13 09:01:02 -04:00
Martin Michlmayr cab893d909 ext4: Remove an old reference to ext4dev in Makefile comment
Remove an old reference to ext4dev.

Signed-off-by: Martin Michlmayr <tbm@cyrius.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-17 15:03:38 -04:00
Theodore Ts'o 03010a3350 ext4: Rename ext4dev to ext4
The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix.  Also remove the requirement for the TEST_FILESYS
flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 20:02:48 -04:00