WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Sage Weil	0a990e7093	ceph: clean up service ticket decoding Previously we would decode state directly into our current ticket_handler. This is problematic if for some reason we fail to decode, because we end up with half new state and half old state. We are probably already in bad shape if we get an update we can't decode, but we may as well be tidy anyway. Decode into new_* temporaries and update the ticket_handler only on success. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-23 07:46:47 -07:00
Dan Carpenter	99b437a925	AFS: Potential null dereference It seems clear from the surrounding code that xpermits is allowed to be NULL here. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-22 09:57:19 -07:00
Jeff Layton	556ae3bb32	NFS: don't try to decode GETATTR if DELEGRETURN returned error The reply parsing code attempts to decode the GETATTR response even if the DELEGRETURN portion of the compound returned an error. The GETATTR response won't actually exist if that's the case and we're asking the parser to read past the end of the response. This bug is fairly benign. The parser catches this without reading past the end of the response and decode_getfattr returns -EIO. Earlier kernels however had decode_op_hdr using the READ_BUF macro, and this bug would make this printk pop any time the client got an error from a delegreturn: kernel: decode_op_hdr: reply buffer overflowed in line XXXX More recent kernels seem to have replaced this printk with a dprintk. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-03-22 05:34:13 -04:00
Ryusuke Konishi	2d8428acae	nilfs2: fix duplicate call to nilfs_segctor_cancel_freev Andreas Beckmann gave me a report that nilfs logged the following warnings when it got a disk full: nilfs_sufile_do_cancel_free: segment 0 must be clean nilfs_sufile_do_cancel_free: segment 1 must be clean These arise from a duplicate call to nilfs_segctor_cancel_freev in an error path of log writer. This will fix the issue. Reported-by: Andreas Beckmann <debian@abeckmann.de> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-22 14:41:07 +09:00
Sage Weil	5b3dbb44ab	ceph: release old ticket_blob buffer Release the old ticket_blob buffer when we get an updated service ticket from the monitor. Previously these were getting leaked. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:11 -07:00
Sage Weil	807c86e2ce	ceph: fix authenticator buffer size calculation The buffer size was incorrectly calculated for the ceph_x_encrypt() encapsulated ticket blob. Use a helper (with correct arithmetic) and BUG out if we were wrong. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:10 -07:00
Sage Weil	63733a0fc5	ceph: fix authenticator timeout We were failing to reconnect to services due to an old authenticator, even though we had the new ticket, because we weren't properly retrying the connect handshake, because we were calling an old/incorrect helper that left in_base_pos incorrect. The result was a failure to reconnect to the OSD or MDS (with an authentication error) if the MDS restarted after the service had been up a few hours (long enough for the original authenticator to be invalid). This was only a problem if the AUTH_X authentication was enabled. Now that the 'negotiate' and 'connect' stages are fully separated, we should use the prepare_read_connect() helper instead, and remove the obsolete one. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:09 -07:00
Sage Weil	8b218b8a4a	ceph: fix inode removal from snap realm when racing with migration When an inode was dropped while being migrated between two MDSs, i_cap_exporting_issued was non-zero such that issue caps were non-zero and __ceph_is_any_caps(ci) was true. This prevented the inode from being removed from the snap realm, even as it was dropped from the cache. Fix this by dropping any residual i_snap_realm ref in destroy_inode. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:08 -07:00
Sage Weil	052bb34af3	ceph: add missing locking to protect i_snap_realm_item during split All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be protected by realm->inodes_with_caps_lock. This bug would have only bit us in a rare race with a realm split (during some snap creations). Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:07 -07:00
Sage Weil	978097c907	ceph: implemented caps should always be superset of issued caps Added assertion, and cleared one case where the implemented caps were not following the issued caps. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:06 -07:00
Tao Ma	b23179681c	ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block. You can't store a pointer that you haven't filled in yet and expect it to work. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-19 14:53:52 -07:00
Tao Ma	dfe4d3d6a6	ocfs2: Fix the update of name_offset when removing xattrs When replacing a xattr's value, in some case we wipe its name/value first and then re-add it. The wipe is done by ocfs2_xa_block_wipe_namevalue() when the xattr is in the inode or block. We currently adjust name_offset for all the entries which have (offset < name_offset). This does not adjust the entrie we're replacing. Since we are replacing the entry, we don't adjust the total entry count. When we calculate a new namevalue location, we trust the entries now-wrong offset in ocfs2_xa_get_free_start(). The solution is to also adjust the name_offset for the replaced entry, allowing ocfs2_xa_get_free_start() to calculate the new namevalue location correctly. The following script can trigger a kernel panic easily. echo 'y'\|mkfs.ocfs2 --fs-features=local,xattr -b 4K $DEVICE mount -t ocfs2 $DEVICE $MNT_DIR FILE=$MNT_DIR/$RANDOM for((i=0;i<76;i++)) do string_76="a$string_76" done string_78="aa$string_76" string_82="aaaa$string_78" touch $FILE setfattr -n 'user.test1234567890' -v $string_76 $FILE setfattr -n 'user.test1234567890' -v $string_78 $FILE setfattr -n 'user.test1234567890' -v $string_82 $FILE Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-19 14:53:51 -07:00
Trond Myklebust	d812e57582	NFS: Prevent another deadlock in nfs_release_page() We should not attempt to free the page if __GFP_FS is not set. Otherwise we can deadlock as per http://bugzilla.kernel.org/show_bug.cgi?id=15578 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-03-19 13:55:17 -04:00
Linus Torvalds	fc7f99cf36	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits) ceph: update for write_inode API change ceph: reset osd after relevant messages timed out ceph: fix flush_dirty_caps race with caps migration ceph: include migrating caps in issued set ceph: fix osdmap decoding when pools include (removed) snaps ceph: return EBADF if waiting for caps on closed file ceph: set osd request message front length correctly ceph: reset front len on return to msgpool; BUG on mismatched front iov ceph: fix snaptrace decoding on cap migration between mds ceph: use single osd op reply msg ceph: reset bits on connection close ceph: remove bogus mds forward warning ceph: remove fragile __map_osds optimization ceph: fix connection fault STANDBY check ceph: invalidate_authorizer without con->mutex held ceph: don't clobber write return value when using O_SYNC ceph: fix client_request_forward decoding ceph: drop messages on unregistered mds sessions; cleanup ceph: fix comments, locking in destroy_inode ceph: move dereference after NULL test ... Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt	2010-03-19 09:43:06 -07:00
Linus Torvalds	0a492fdef8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: trivial white space [CIFS] checkpatch cleanup cifs: add cifs_revalidate_file cifs: add a CIFSSMBUnixQFileInfo function cifs: add a CIFSSMBQFileInfo function cifs: overhaul cifs_revalidate and rename to cifs_revalidate_dentry	2010-03-19 09:36:18 -07:00
Jens Axboe	b4b7a4ef09	Merge branch 'master' into for-linus Conflicts: block/Kconfig Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-19 08:05:10 +01:00
Linus Torvalds	441f4058a0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits) Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree Btrfs: allow treeid==0 in the inode lookup ioctl Btrfs: return keys for large items to the search ioctl Btrfs: fix key checks and advance in the search ioctl Btrfs: buffer results in the space_info ioctl Btrfs: use __u64 types in ioctl.h Btrfs: fix search_ioctl key advance Btrfs: fix gfp flags masking in the compression code Btrfs: don't look at bio flags after submit_bio btrfs: using btrfs_stack_device_id() get devid btrfs: use memparse Btrfs: add a "df" ioctl for btrfs Btrfs: cache the extent state everywhere we possibly can V2 Btrfs: cache ordered extent when completing io Btrfs: cache extent state in find_delalloc_range Btrfs: change the ordered tree to use a spinlock instead of a mutex Btrfs: finish read pages in the order they are submitted btrfs: fix btrfs_mkdir goto for no free objectids Btrfs: flush data on snapshot creation Btrfs: make df be a little bit more understandable ...	2010-03-18 16:50:55 -07:00
Linus Torvalds	7c34691abe	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: ensure bdi_unregister is called on mount failure. NFS: Avoid a deadlock in nfs_release_page NFSv4: Don't ignore the NFS_INO_REVAL_FORCED flag in nfs_revalidate_inode() nfs4: Make the v4 callback service hidden nfs: fix unlikely memory leak rpc client can not deal with ENOSOCK, so translate it into ENOCONN	2010-03-18 16:50:09 -07:00
Linus Torvalds	01d61d0d64	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: don't warn about page discards on shutdown xfs: use scalable vmap API xfs: remove old vmap cache	2010-03-18 16:46:05 -07:00
Mark Fasheh	b22b63ebaf	ocfs2: Always try for maximum bits with new local alloc windows What we were doing before was to ask for the current window size as the maximum allocation. This had the effect of limiting the amount of allocation we could get for the local alloc during times when the window size was shrunk due to fragmentation. In some cases, that could actually increase fragmentation by artificially limiting the number of bits we can accept. So while we still want to ask for a minimum number of bits equal to window size, there is no reason why we should limit the number of bits the local alloc should accept. Hence always allow the maximum number of local alloc bits. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-18 13:22:42 -07:00
Chris Mason	8ad6fcab56	Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree This is used by the inode lookup ioctl to follow all the backrefs up to the subvol root. But the search being done would sometimes land one past the last item in the leaf instead of finding the backref. This changes the search to look for the highest possible backref and hop back one item. It also fixes a leaked path on failure to find the root. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-18 12:23:10 -04:00
Chris Mason	1b53ac4d1b	Btrfs: allow treeid==0 in the inode lookup ioctl When a root id of 0 is sent to the inode lookup ioctl, it will use the root of the file we're ioctling and pass the root id back to userland along with the results. This allows userland to do searches based on that root later on. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-18 12:17:05 -04:00
Chris Mason	90fdde147f	Btrfs: return keys for large items to the search ioctl The search ioctl was skipping large items entirely (ones that are too big for the results buffer). This changes things to at least copy the item header so that we can send information about the item back to userland. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-18 12:14:54 -04:00
Chris Mason	abc6e1341b	Btrfs: fix key checks and advance in the search ioctl The search ioctl was working well for finding tree roots, but using it for generic searches requires a few changes to how the keys are advanced. This treats the search control min fields for objectid, type and offset more like a key, where we drop the offset to zero once we bump the type, etc. The downside of this is that we are changing the min_type and min_offset fields during the search, and so the ioctl caller needs extra checks to make sure the keys in the result are the ones it wanted. This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make things more readable. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-18 12:10:08 -04:00
Akinobu Mita	c4af96449e	ntfs: use bitmap_weight Use bitmap_weight() instead of doing hweight32() for each u32 element in the page. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-17 18:43:47 -07:00
Venkatesh Pallipadi	bcc54e2a6d	jffs2: fix up rb_root initializations to use RB_ROOT jffs2 uses rb_node = NULL; to zero rb_root. The problem with this is that `17d9ddc72f` ("rbtree: Add support for augmented rbtrees") in the linux-next tree adds a new field to that struct which needs to be NULL as well. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Eric Paris <eparis@redhat.com> Acked-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-17 18:43:47 -07:00
Mark Fasheh	fcefd25ac8	ocfs2: set i_mode on disk during acl operations ocfs2_set_acl() and ocfs2_init_acl() were setting i_mode on the in-memory inode, but never setting it on the disk copy. Thus, acls were some times not getting propagated between nodes. This patch fixes the issue by adding a helper function ocfs2_acl_set_mode() which does this the right way. ocfs2_set_acl() and ocfs2_init_acl() are then updated to call ocfs2_acl_set_mode(). Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-17 12:28:22 -07:00
Tao Ma	6527f8f848	ocfs2: Update i_blocks in reflink operations. In reflink, we need to upate i_blocks for the target inode. Reported-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-17 12:28:00 -07:00
Tao Ma	78c37eb0d5	ocfs2: Change bg_chain check for ocfs2_validate_gd_parent. In ocfs2_validate_gd_parent, we check bg_chain against the cl_next_free_rec of the dinode. Actually in resize, we have the chance of bg_chain == cl_next_free_rec. So add some additional condition check for it. I also rename paramter "clean_error" to "resize", since the old one is not clearly enough to indicate that we should only meet with this case in resize. btw, the correpsonding bug is http://oss.oracle.com/bugzilla/show_bug.cgi?id=1230. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-17 12:07:21 -07:00
Sachin Prabhu	ee860b6a65	[PATCH] Skip check for mandatory locks when unlocking ocfs2_lock() will skip locks on file which has mode set to 02666. This is a problem in cases where the mode of the file is changed after a process has obtained a lock on the file. ocfs2_lock() should skip the check for mandatory locks when unlocking a file. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2010-03-17 12:07:16 -07:00
Dave Chinner	e8c3753ce4	xfs: don't warn about page discards on shutdown If we are doing a forced shutdown, we can get lots of noise about delalloc pages being discarded. This is happens by design during a forced shutdown, so don't spam the logs with these messages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:53 -05:00
Alex Elder	8a262e573d	xfs: use scalable vmap API Re-apply a commit that had been reverted due to regressions that have since been fixed. From `95f8e302c0` Mon Sep 17 00:00:00 2001 From: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:43:09 +1100 Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Only modifications here were a minor reformat, plus making the patch apply given the new use of xfs_buf_is_vmapped(). Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:36 -05:00
Alex Elder	cd9640a70d	xfs: remove old vmap cache Re-apply a commit that had been reverted due to regressions that have since been fixed. Original commit: `d2859751cd` Author: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:40:44 +1100 XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much). Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> The only change I made was to use the "new" xfs_buf_is_vmapped() function in a place it had been open-coded in the original. Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:19 -05:00
Chris Mason	7fde62bffb	Btrfs: buffer results in the space_info ioctl The space_info ioctl was using copy_to_user inside rcu_read_lock. This commit changes things to copy into a buffer first and then dump the result down to userland. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-16 15:40:10 -04:00
Sage Weil	ce769a2904	Btrfs: use __u64 types in ioctl.h Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-16 14:24:27 -04:00
Sage Weil	854d2c3531	Btrfs: fix search_ioctl key advance key->type is u8, not u64. fs/btrfs/ioctl.c: In function 'copy_to_sk': fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-16 14:24:27 -04:00
NeilBrown	cfbc0683af	NFS: ensure bdi_unregister is called on mount failure. bdi_unregister is called by nfs_put_super which is only called by generic_shutdown_super if ->s_root is not NULL. So if we error out in a circumstance where we called nfs_bdi_register (i.e. server != NULL) but have not set s_root, then we need to call bdi_unregister explicitly in nfs_get_sb and various other *_get_sb() functions. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-03-15 15:37:45 -04:00
Dan Carpenter	8212cf7583	cifs: trivial white space I fixed the indent level. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-15 15:19:47 +00:00
Nick Piggin	ef5780c018	Btrfs: fix gfp flags masking in the compression code GFP_FS must be masked out, NOFS can't be or'd in. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:05:57 -04:00
Chris Mason	5ff7ba3a79	Btrfs: don't look at bio flags after submit_bio After callling submit_bio, the bio can be freed at any time. The btrfs submission thread helper was checking the bio flags too late, which might not give the correct answer. When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:15 -04:00
Xiao Guangrong	a343832f1a	btrfs: using btrfs_stack_device_id() get devid We can use btrfs_stack_device_id() to get dev_item->devid Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:14 -04:00
Akinobu Mita	91748467a5	btrfs: use memparse Use memparse() instead of its own private implementation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:14 -04:00
Josef Bacik	1406e4327b	Btrfs: add a "df" ioctl for btrfs df is a very loaded question in btrfs. This gives us a way to get the per-space usage information so we can tell exactly what is in use where. This will help us figure out ENOSPC problems, and help users better understand where their disk space is going. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:14 -04:00
Josef Bacik	2ac55d41b5	Btrfs: cache the extent state everywhere we possibly can V2 This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:13 -04:00
Josef Bacik	5a1a3df1f6	Btrfs: cache ordered extent when completing io When finishing io we run btrfs_dec_test_ordered_pending, and then immediately run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that already, so we're searching twice when we don't have to. This patch lets us pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do complete io on that ordered extent we can just use the one we found then instead of having to do another btrfs_lookup_ordered_extent. This made my fio job with the other patch go from 24 mb/s to 29 mb/s. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:13 -04:00
Josef Bacik	c2a128d28a	Btrfs: cache extent state in find_delalloc_range This patch makes us cache the extent state we find in find_delalloc_range since we'll have to lock the extent later on in the function. This will keep us from re-searching for the rang when we try to lock the extent. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:13 -04:00
Josef Bacik	49958fd7db	Btrfs: change the ordered tree to use a spinlock instead of a mutex The ordered tree used to need a mutex, but currently all we use it for is to protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have less latency, 3445.138 ms instead of 3820.633 ms. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:12 -04:00
Chris Mason	4125bf761c	Btrfs: finish read pages in the order they are submitted The endio is done at reverse order of bio vectors. That means for a sequential read, the page first submitted will finish last in a bio. Considering we will do checksum (making cache hot) for every page, this does introduce delay (and chance to squeeze cache used soon) for pages submitted at the begining. I don't observe obvious performance difference with below patch at my simple test, but seems more natural to finish read in the order they are submitted. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:12 -04:00
Miao Xie	0be2e98173	btrfs: fix btrfs_mkdir goto for no free objectids btrfs_mkdir() must jump to the place of ending transaction after btrfs_find_free_objectid() failed. Or this transaction can't end. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:11 -04:00
Sage Weil	0bdb1db297	Btrfs: flush data on snapshot creation Flush any delalloc extents when we create a snapshot, so that recently written file data is always included in the snapshot. A later commit will add the ability to snapshot without the flush, but most people expect flushing. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:11 -04:00
Josef Bacik	bd4d108889	Btrfs: make df be a little bit more understandable The way we report df usage is way confusing for everybody, including some other utilities (bacula for one). So this patch makes df a little bit more understandable. First we make used actually count the total amount of used space in all space info's. This will give us a real view of how much disk space is in use. Second, for blocks available, only count data space. This makes things like bacula work because it says 0 when you can no longer write anymore data to the disk. I think this is a nice compromise, since you will end up with something like the following [root@alpha ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 148G 30G 111G 21% / /dev/sda1 194M 116M 68M 64% /boot tmpfs 985M 12K 985M 1% /dev/shm /dev/mapper/VolGroup-LogVol02 145G 140G 0 100% /mnt/btrfs-test Compare this with btrfsctl -i output [root@alpha btrfs-progs-unstable]# ./btrfsctl -i /mnt/btrfs-test/ Metadata, DUP: total=4.62GB, used=2.46GB System, DUP: total=8.00MB, used=24.00KB Data: total=134.80GB, used=134.80GB Metadata: total=8.00MB, used=0.00 System: total=4.00MB, used=0.00 operation complete This way we show that there is no more data space to be used, but we have another 5GB of space left for metadata. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:11 -04:00
TARUISI Hiroaki	3a0524dc05	btrfs: Update existing btrfs_device for renaming device When we scan devices in a multi-device filesystem, we memorize the original name. If the device gets a new name, later scans don't update the in-kernel structures related to it, and we're not able to mount the filesystem. This patch updates device name during scaning. Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:10 -04:00
Chris Mason	1e701a3292	Btrfs: add new defrag-range ioctl. The btrfs defrag ioctl was limited to doing the entire file. This commit adds a new interface that can defrag a specific range inside the file. It can also force compression on the file, allowing you to selectively compress individual files after they were created, even when mount -o compress isn't turned on. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:10 -04:00
Chris Mason	940100a4a7	Btrfs: be more selective in the defrag ioctl The btrfs defrag ioctl had some bugs around delalloc accounting, and it wasn't properly skipping pages that were not in the mapping. It wasn't properly clearing the page checked flag, which could make the writeback code ignore the page forever while pinning it as dirty. This commit fixes those problems and makes defrag a little smarter. It skips holes and it doesn't waste time defragging large extents. If a tiny extent comes before a very large extent, it will defrag both of them to make sure the tiny extent ends up next to something big. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:10 -04:00
Chris Mason	51684082b1	Btrfs: run the backing dev more often in the submit_bio helper The submit_bio helper thread can decide to loop back around to service more bios. This commit forces it to unplug first, which helps reduce the latency seen by submitters. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:09 -04:00
Josef Bacik	4849f01d15	Btrfs: make subvolid=0 mount the original default root Since theres not a good way to make sure the user sees the original default root tree id, and not to mention it's 5 so is way different than any other volume, just make subvol=0 mount the original default root. This makes it a bit easier for users to handle in the long run. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:09 -04:00
Josef Bacik	6ef5ed0d38	Btrfs: add ioctl and incompat flag to set the default mount subvol This patch needs to go along with my previous patch. This lets us set the default dir item's location to whatever root we want to use as our default mounting subvol. With this we don't have to use mount -o subvol=<tree id> anymore to mount a different subvol, we can just set the new one and it will just magically work. I've done some moderate testing with this, mostly just switching the default mount around, mounting subvols and the default mount at the same time and such, everything seems to work. Thanks, Older kernels would generally be able to still mount the filesystem with the default subvolume set, but it would result in a different volume being mounted, which could be an even more unpleasant suprise for users. So if you set your default subvolume, you can't go back to older kernels. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 11:00:08 -04:00
Josef Bacik	73f73415ca	Btrfs: change how we mount subvolumes This work is in preperation for being able to set a different root as the default mounting root. There is currently a problem with how we mount subvolumes. We cannot currently mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the default subvolume. So say you take a snapshot of the default subvolume and call it snap1, and then take a snapshot of snap1 and call it snap2, so now you have / /snap1 /snap1/snap2 as your available volumes. Currently you can only mount / and /snap1, you cannot mount /snap1/snap2. To fix this problem instead of passing subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is the tree id that gets spit out via the subvolume listing you get from the subvolume listing patches (btrfs filesystem list). This allows us to mount /, /snap1 and /snap1/snap2 as the root volume. In addition to the above, we also now read the default dir item in the tree root to get the root key that it points to. For now this just points at what has always been the default subvolme, but later on I plan to change it to point at whatever root you want to be the new default root, so you can just set the default mount and not have to mount with -o subvolid=<treeid>. I tested this out with the above scenario and it worked perfectly. Thanks, mount -o subvol operates inside the selected subvolid. For example: mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt /mnt will have the snap1 directory for the subvolume with id 256. mount -o subvol=snap /dev/xxx /mnt /mnt will be the snap directory of whatever the default subvolume is. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 10:58:13 -04:00
Josef Bacik	12534832cb	Btrfs: make set/get functions for the super compat_ro flags use compat_ro Our set/get functions for compat_ro_flags actually look at compat_flags. This will mess any attempt to use compat flags up. The fix is obvious. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 10:55:10 -04:00
Chris Mason	ac8e9819d7	Btrfs: add search and inode lookup ioctls The search ioctl is a generic tool for doing btree searches from userland applications. The first user of the search ioctl is a subvolume listing feature, but we'll also use it to find new files in a subvolume. The search ioctl allows you to specify min and max keys to search for, along with min and max transid. It returns the items along with a header that includes the item key. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 10:55:10 -04:00
TARUISI Hiroaki	98d377a089	Btrfs: add a function to lookup a directory path by following backrefs This will be used by the inode lookup ioctl. Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-15 10:55:09 -04:00
Jan Kara	d330a5befb	ext4: Fix estimate of # of blocks needed to write indirect-mapped files http://bugzilla.kernel.org/show_bug.cgi?id=15420 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-03-14 18:17:54 -04:00
Linus Torvalds	f901e75392	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: remove whitespaces before quoted newlines nilfs2: remove spaces before tabs nilfs2: fix various typos in comments nilfs2: fix typo "cout" -> "count" in error message nilfs2: fix function name typos in docbook comments nilfs2: fix discrepancy in use of static specifier	2010-03-14 11:13:24 -07:00
Linus Torvalds	ceb804cd0f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: Skip check for mandatory locks when unlocking 9p: Fixes a simple bug enabling writes beyond 2GB. 9p: Change the name of new protocol from 9p2010.L to 9p2000.L fs/9p: re-init the wstat in readdir loop net/9p: Add sysfs mount_tag file for virtio 9P device net/9p: Use the tag name in the config space for identifying mount point	2010-03-14 11:11:08 -07:00
Ryusuke Konishi	c91cea11df	nilfs2: remove whitespaces before quoted newlines This kills the following checkpatch warnings: WARNING: unnecessary whitespace before a quoted newline #869: FILE: super.c:869: + "remount to a different snapshot. \n", WARNING: unnecessary whitespace before a quoted newline #389: FILE: the_nilfs.c:389: + printk(KERN_ERR "NILFS: too short segment. \n"); Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:29:51 +09:00
Ryusuke Konishi	55480a06e9	nilfs2: remove spaces before tabs This kills the following checkpatch warnings: WARNING: please, no space before tabs #74: FILE: segment.h:74: +^Iunsigned ^I^Iflags;$ WARNING: please, no space before tabs #35: FILE: segbuf.c:35: +^Iint ^I^I^Istart, end; /* The region to be submitted */$ Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:29:51 +09:00
Ryusuke Konishi	7a65004bba	nilfs2: fix various typos in comments This fixes various typos I found in comments of nilfs2. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:29:51 +09:00
Ryusuke Konishi	1621562b6a	nilfs2: fix typo "cout" -> "count" in error message Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:29:50 +09:00
Ryusuke Konishi	9ccf56c138	nilfs2: fix function name typos in docbook comments Fixes the following typos in docbook comments: nilfs_detroy_transaction_cache -> nilfs_destroy_transaction_cache nilfs_secgtor_start_timer -> nilfs_segctor_start_timer Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:29:50 +09:00
Ryusuke Konishi	6c477d44a7	nilfs2: fix discrepancy in use of static specifier Two segbuf functions, nilfs_segbuf_write and nilfs_segbuf_wait, are declared with the static storage class specifier, but their implementations are not. This fixes the discrepancy. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-03-14 10:27:27 +09:00
Linus Torvalds	8cea4eb642	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Skip check for mandatory locks when unlocking GFS2: Allow the number of committed revokes to temporarily be negative GFS2: do not select QUOTA	2010-03-13 14:38:53 -08:00
Sachin Prabhu	f78233dd44	9p: Skip check for mandatory locks when unlocking While investigating a bug, I came across a possible bug in v9fs. The problem is similar to the one reported for NFS by ASANO Masahiro in http://lkml.org/lkml/2005/12/21/334. v9fs_file_lock() will skip locks on file which has mode set to 02666. This is a problem in cases where the mode of the file is changed after a process has obtained a lock on the file. Such a lock will be skipped during unlock and the machine will end up with a BUG in locks_remove_flock(). v9fs_file_lock() should skip the check for mandatory locks when unlocking a file. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-03-13 09:05:37 -06:00
jvrao	fc0f296126	9p: Fixes a simple bug enabling writes beyond 2GB. Fixes a simple bug so that large files beyond 2GB can be created. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-03-13 08:59:54 -06:00
Sripathi Kodi	45bc21edb5	9p: Change the name of new protocol from 9p2010.L to 9p2000.L This patch changes the name of the new 9P protocol from 9p2010.L to 9p2000.u. This is because we learnt that the name 9p2010 is already being used by others. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-03-13 08:57:29 -06:00
Aneesh Kumar K.V	fae4528b23	fs/9p: re-init the wstat in readdir loop This ensure that on failure when we free the stat buf we don't end up freeing an already freed pointer in the earlier loop Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-03-13 08:57:29 -06:00
Linus Torvalds	9d85929fef	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: Fix stat->f_namelen fat: Fix vfat_lookup()	2010-03-12 16:35:21 -08:00
Eric Paris	3836a03d97	anon_inodes: mark the anon inode private Inotify was switched to use anon_inode instead of its own private filesystem which only had one inode in commit `c44dcc56d2` "switch inotify_user to anon_inode" The problem with this is that now the inotify inode is not a distinct inode which can be managed by LSMs. userspace tools which use inotify were allowed to use the inotify inode but may not have had permission to do read/write type operations on the anon_inode. After looking at the anon_inode and its users it looks like the best solution is to just mark the anon_inode as S_PRIVATE so the security system will ignore it. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 16:25:23 -08:00
Linus Torvalds	83c0fb6500	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: use ext2_find_next_bit udf: Do not read inode before writing it udf: Fix unalloc space handling in udf_update_inode	2010-03-12 16:22:50 -08:00
Linus Torvalds	c32da02342	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits) doc: fix typo in comment explaining rb_tree usage Remove fs/ntfs/ChangeLog doc: fix console doc typo doc: cpuset: Update the cpuset flag file Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed Remove drivers/parport/ChangeLog Remove drivers/char/ChangeLog doc: typo - Table 1-2 should refer to "status", not "statm" tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h devres/irq: Fix devm_irq_match comment Remove reference to kthread_create_on_cpu tree-wide: Assorted spelling fixes tree-wide: fix 'lenght' typo in comments and code drm/kms: fix spelling in error message doc: capitalization and other minor fixes in pnp doc devres: typo fix s/dev/devm/ Remove redundant trailing semicolons from macros fix typo "definetly" -> "definitely" in comment tree-wide: s/widht/width/g typo in comments ... Fix trivial conflict in Documentation/laptops/00-INDEX	2010-03-12 16:04:50 -08:00
Evgeniy Dushistov	ad25ad979a	ufs: make solaris fsck happy Alex Viskovatoff let me know that after copying data to solaris's ufs from linux, solaris's fsck sees some errors in cylinder summary information. This is because of solaris expects find some data on another places, then curernt implementation save it. This patch fixes this issue. It is tested by me, and also Alex reported that it works for him. Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru> Reported-by: Alex Viskovatoff <viskovatoff@imap.cc> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:35 -08:00
Alex Viskovatoff	b3a0fd4d87	fs/ufs: recognize Solaris-specific file system state Recent releases of Solaris set the fs_clean state of an unmounted UFS file system as FSLOG ("logging fs"). However, the Linux kernel currently does not recognize the value which represents this state. Thus, attempting to mount such a file system rw produces the message kernel: ufs_read_super: can't grok fs_clean 0xfffffffd and the file system is mounted read-only. This patch makes the kernel recognize that value. Signed-off-by: Alex Viskovatoff <viskovatoff@imap.cc> Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:35 -08:00
Christoph Hellwig	5d0e52830e	Add generic sys_old_select() Add a generic implementation of the old select() syscall, which expects its argument in a memory block and switch all architectures over to use it. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Acked-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Greg Ungerer <gerg@uclinux.org> Acked-by: David Howells <dhowells@redhat.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Richard Kennedy	019b4d123a	fs: buffer_head: remove kmem_cache constructor to reduce memory usage under slub When using slub, having a kmem_cache constructor forces slub to add a free pointer to the size of the cached object, which can have a significant impact to the number of small objects that can fit into a slab. As buffer_head is relatively small and we can have large numbers of them, removing the constructor is a definite win. On x86_64 removing the constructor gives me 39 objects/slab, 3 more than without the patch. And on x86_32 73 objects/slab, which is 9 more. As alloc_buffer_head() already initializes each new object there is very little difference in actual code run. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:27 -08:00
Joe Perches	03affdef4f	fs/ocfs2/cluster/tcp.c: remove use of NIPQUAD, use %pI4 Signed-off-by: Joe Perches <joe@perches.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:27 -08:00
Edward Shishkin	f11c9c5c25	vfs: improve writeback_inodes_wb() Do not pin/unpin superblock for every inode in writeback_inodes_wb(), pin it for the whole group of inodes which belong to the same superblock and call writeback_sb_inodes() handler for them. Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-12 10:03:42 +01:00
Sachin Prabhu	720e774927	GFS2: Skip check for mandatory locks when unlocking gfs2_lock() will skip locks on file which have mode set to 02666. This is a problem in cases where the mode of the file is changed after a process has obtained a lock on the file. Such a lock will be skipped and will result in a BUG in locks_remove_flock(). gfs2_lock() should skip the check for mandatory locks when unlocking a file. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-11 17:17:57 +00:00
Trond Myklebust	bb6fbc4548	NFS: Avoid a deadlock in nfs_release_page J.R. Okajima reports the following deadlock: INFO: task kswapd0:305 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D 0000000000000001 0 305 2 0x00000000 ffff88001f21d4f0 0000000000000046 ffff88001fdea680 ffff88001f21c000 ffff88001f21dfd8 ffff88001f21c000 ffff88001f21dfd8 ffff88001f21dfd8 ffff88001fdea040 0000000000014c00 0000000000000001 ffff88001fdea040 Call Trace: [<ffffffff8146155d>] io_schedule+0x4d/0x70 [<ffffffff810d2be5>] sync_page+0x65/0xa0 [<ffffffff81461b12>] __wait_on_bit_lock+0x52/0xb0 [<ffffffff810d2b80>] ? sync_page+0x0/0xa0 [<ffffffff810d2b64>] __lock_page+0x64/0x70 [<ffffffff81070ce0>] ? wake_bit_function+0x0/0x40 [<ffffffff810df1d4>] truncate_inode_pages_range+0x344/0x4a0 [<ffffffff810df340>] truncate_inode_pages+0x10/0x20 [<ffffffff8112cbfe>] generic_delete_inode+0x15e/0x190 [<ffffffff8112cc8d>] generic_drop_inode+0x5d/0x80 [<ffffffff8112bb88>] iput+0x78/0x80 [<ffffffff811bc908>] nfs_dentry_iput+0x38/0x50 [<ffffffff811285f4>] dentry_iput+0x84/0x110 [<ffffffff811286ae>] d_kill+0x2e/0x60 [<ffffffff8112912a>] dput+0x7a/0x170 [<ffffffff8111e925>] path_put+0x15/0x40 [<ffffffff811c3a44>] __put_nfs_open_context+0xa4/0xb0 [<ffffffff811cb5d0>] ? nfs_free_request+0x0/0x50 [<ffffffff811c3b0b>] put_nfs_open_context+0xb/0x10 [<ffffffff811cb5f9>] nfs_free_request+0x29/0x50 [<ffffffff81234b7e>] kref_put+0x8e/0xe0 [<ffffffff811cb594>] nfs_release_request+0x14/0x20 [<ffffffff811cf769>] nfs_find_and_lock_request+0x89/0xa0 [<ffffffff811d1180>] nfs_wb_page+0x80/0x110 [<ffffffff811c0770>] nfs_release_page+0x70/0x90 [<ffffffff810d18ee>] try_to_release_page+0x5e/0x80 [<ffffffff810e1178>] shrink_page_list+0x638/0x860 [<ffffffff810e19de>] shrink_zone+0x63e/0xc40 We can fix this by making the call to put_nfs_open_context() happen when we actually remove the write request from the inode (which is done by the nfsiod thread in this case). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-03-11 09:19:35 -05:00
Benjamin Marzinski	2e95e3f668	GFS2: Allow the number of committed revokes to temporarily be negative GFS2 tracks the number of revokes and unrevokes that are part of committed transactions via sd_log_commited_revoke. It is possible for one process to add revokes during its transaction, while another process unrevokes them during its transaction. If the second process finishes its transaction first, sd_log_commited_revoke will be decremented by the number of unrevokes that the second process did, without first being incremented by the number of revokes the first process did. This is fine, since all started transactions must be completed before the journal can be flushed. However, sd_log_commited_revoke is an unsigned integer, and log_refund() causes an assertion failure if it would go negative at the end of a transaction. This patch makes sd_log_commited_revoke a signed integer and allows it to go negative. __gfs2_log_flush() still checks that it mataches the actual number of revokes. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-11 09:50:46 +00:00
Trond Myklebust	b4d2314bb8	NFSv4: Don't ignore the NFS_INO_REVAL_FORCED flag in nfs_revalidate_inode() If the NFS_INO_REVAL_FORCED flag is set, that means that we don't yet have an up to date attribute cache. Even if we hold a delegation, we must put a GETATTR on the wire. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2010-03-10 15:21:44 -05:00
Steve French	ff215713eb	[CIFS] checkpatch cleanup Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-09 20:30:42 +00:00
Jeff Layton	abab095d1f	cifs: add cifs_revalidate_file ...to allow updating inode attributes on an existing inode by filehandle. Change mmap and llseek codepaths to use that instead of cifs_revalidate_dentry since they have a filehandle readily available. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-09 20:22:53 +00:00
Akinobu Mita	3a065fcf9e	udf: use ext2_find_next_bit Use ext2_find_next_bit (generic_find_next_le_bit) to find the set bit in little endian bitmap region. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-09 17:15:18 +01:00
Jan Kara	5833ded9b6	udf: Do not read inode before writing it We needlessly read inode in udf_update_inode just before zeroing out the contents of the buffer. Fix it. Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-09 17:15:17 +01:00
Jan Kara	aae917cd18	udf: Fix unalloc space handling in udf_update_inode Writing of inode holding unallocated space info was broken because we first cleared the buffer and after that checked whether it contains a tag meaning the block holds unallocated space information. Fix the problem by checking appropriate in memory flag instead. Also cleanup the function a bit along the way - most importantly lock buffer when modifying its contents, check for buffer_write_io_error instead of !buffer_uptodate, etc.. Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-09 17:15:17 +01:00
Christoph Hellwig	e9edb1d8a3	GFS2: do not select QUOTA gfs2 only needs the quotactl code, not the generic quota implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-03-09 10:08:36 +00:00
Linus Torvalds	51d0f6d1f5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: kfree correct pointer during mount option parsing Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL	2010-03-08 14:07:53 -08:00
Josef Bacik	da495ecc0f	Btrfs: kfree correct pointer during mount option parsing We kstrdup the options string, but then strsep screws with the pointer, so when we kfree() it, we're not giving it the right pointer. Tested-by: Andy Lutomirski <luto@mit.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-08 16:26:50 -05:00
Eric Paris	6bef4d3171	Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL btrfs inialize rb trees in quite a number of places by settin rb_node = NULL; The problem with this is that `17d9ddc72f` in the linux-next tree adds a new field to that struct which needs to be NULL for the new rbtree library code to work properly. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Without the patch I get a panic. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-03-08 16:26:50 -05:00
Steve Dickson	49697ee792	nfs4: Make the v4 callback service hidden To avoid hangs in the svc_unregister(), on version 4 mounts (and unmounts), when rpcbind is not running, make the nfs4 callback program an 'hidden' service by setting the 'vs_hidden' flag in the nfs4_callback_version structure. Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-03-08 14:56:42 -05:00
Dan Carpenter	7dd08a570d	nfs: fix unlikely memory leak I'll admit that it's unlikely for the first allocation to fail and the second one to succeed. I won't be offended if you ignore this patch. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-03-08 14:10:00 -05:00
Linus Torvalds	e10154189f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (62 commits) msi-laptop: depends on RFKILL msi-laptop: Detect 3G device exists by standard ec command msi-laptop: Add resume method for set the SCM load again msi-laptop: Support some MSI 3G netbook that is need load SCM msi-laptop: Add threeg sysfs file for support query 3G state by standard 66/62 ec command msi-laptop: Support standard ec 66/62 command on MSI notebook and nebook Driver core: create lock/unlock functions for struct device sysfs: fix for thinko with sysfs_bin_attr_init() sysfs: Kill unused sysfs_sb variable. sysfs: Pass super_block to sysfs_get_inode driver core: Use sysfs_rename_link in device_rename sysfs: Implement sysfs_rename_link sysfs: Pack sysfs_dirent more tightly. sysfs: Serialize updates to the vfs inode sysfs: windfarm: init sysfs attributes sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes sysfs: Document sysfs_attr_init and sysfs_bin_attr_init sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributes sysfs: Use one lockdep class per sysfs attribute. sysfs: Only take active references on attributes. ...	2010-03-08 10:17:20 -08:00
Jiri Kosina	318ae2edc3	Merge branch 'for-next' into for-linus Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c	2010-03-08 16:55:37 +01:00
Christian Kujau	d4014030d2	FS-Cache: Remove the EXPERIMENTAL flag Remove the EXPERIMENTAL flag from FS-Cache so that Ubuntu can make use of the facility. Signed-off-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-08 07:32:34 -08:00
Dmitry Monakhov	8bf8c376ab	blkdev: fix merge_bvec_fn return value checks v2 merge_bvec_fn() returns bvec->bv_len on success. So we have to check against this value. But in case of fs_optimization merge we compare with wrong value. This patch must be included in b428cd6da7e6559aca69aa2e3a526037d3f20403 But accidentally i've forgot to add this in the initial patch. To make things straight let's replace all such checks. In fact this makes code easy to understand. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-08 09:10:38 +01:00
Eric W. Biederman	0f4288ec6f	sysfs: Kill unused sysfs_sb variable. Now that there are no more users we can remove the sysfs_sb variable. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:52 -08:00
Eric W. Biederman	fac2622bba	sysfs: Pass super_block to sysfs_get_inode Currently sysfs_get_inode magically returns an inode on sysfs_sb. Make the super_block parameter explicit and the code becomes clearer. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:52 -08:00
Eric W. Biederman	7cb32942d9	sysfs: Implement sysfs_rename_link Because of rename ordering problems we occassionally give false warnings about invalid sysfs operations. So using sysfs_rename create a sysfs_rename_link function that doesn't need strange workarounds. Cc: Benjamin Thery <benjamin.thery@bull.net> Cc: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:52 -08:00
Eric W. Biederman	19c38b632d	sysfs: Pack sysfs_dirent more tightly. Placing the 16bit s_mode between a pointer and a long doesn't pack well especailly on 64bit where we wast 48 bits. So move s_mode and declare it as a unsigned short. This is the sysfs backing store after all we don't need fields extra large just in case someday we want userspace to be able to use a larger value. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:52 -08:00
Eric W. Biederman	f8d4f618fe	sysfs: Serialize updates to the vfs inode The vfs depends upon filesystem methods to update the vfs inode. Sysfs adds to the normal number of places where the vfs inode is updated by also updatng the vfs inode in sysfs_refresh_inode. Typically the inode mutex is used to serialize updates to the vfs inode, but grabbing the inode mutex in sysfs_permission and sysfs_getattr causes deadlocks, because sometimes the vfs calls those operations with the inode mutex held. Therefore sysfs can not use the inode mutex to serial updates to the vfs inode. The sysfs_mutex is acquired in all of the routines where sysfs updates the vfs inode, and with a small change we can consistently protext sysfs vfs inode updates with the sysfs_mutex. To protect the sysfs vfs inode updates with the sysfs_mutex simply requires extending the scope of sysfs_mutex in sysfs_setattr over inode_setattr, and over inode_change_ok (so we have an unchanging inode when we perform the check). Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:52 -08:00
Eric W. Biederman	6992f53349	sysfs: Use one lockdep class per sysfs attribute. Acknowledge that the logical sysfs rwsem has one instance per sysfs attribute with different locking depencencies for different attributes. There is a sysfs idiom where writing to one sysfs file causes the addition or removal of other sysfs files. Lumping all of the sysfs attributes together in one lock class causes lockdep to generate lots of false positives. This introduces the requirement that non-static sysfs attributes need to be initialized with sysfs_attr_init or sysfs_bin_attr_init. Strictly speaking this requirement only exists when lockdep is enabled, and when lockdep is enabled we get a bit fat warning if this requirement is not met. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:51 -08:00
Eric W. Biederman	a2db684287	sysfs: Only take active references on attributes. If we exclude directories and symlinks from the set of sysfs dirents where we need active references we are left with sysfs attributes (binary or not). - Tweak sysfs_deactivate to only do something on attributes - Move lockdep initialization into sysfs_file_add_mode to limit it to just attributes. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:51 -08:00
Eric W. Biederman	e72ceb8cca	sysfs: Remove sysfs_get/put_active_two It turns out that holding an active reference on a directory is pointless. The purpose of the active references are to allows us to block when removing sysfs entries that have custom methods so we don't remove modules while running modular code and to keep those custom methods from accessing data structures after the files have been removed. Further sysfs_remove_dir remove all elements in the directory before removing the directory itself, so there is no chance we will remove a directory with active children. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:51 -08:00
Emese Revfy	52cf25d0ab	Driver core: Constify struct sysfs_ops in struct kobj_type Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Acked-by: David Teigland <teigland@redhat.com> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Acked-by: Hans J. Koch <hjk@linutronix.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Emese Revfy	9cd43611cc	kobject: Constify struct kset_uevent_ops Constify struct kset_uevent_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Eric W. Biederman	1e5289c97b	sysfs: Cache the last sysfs_dirent to improve readdir scalability v2 When sysfs_readdir stops short we now cache the next sysfs_dirent to return to user space in filp->private_data. There is no impact on the rest of sysfs by doing this and in the common case it allows us to pick up exactly where we left off with no seeking. Additionally I drop and regrab the sysfs_mutex around filldir to avoid a page fault abritrarily increasing the hold time on the sysfs_mutex. v2: Returned to using INT_MAX as the EOF condition. seekdir is ambiguous unless all directory entries have a unique f_pos value. Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14949 Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:48 -08:00
Andi Kleen	1c205ae18d	sysfs: Add sysfs_add/remove_files utility functions Adding/Removing a whole array of attributes is very common. Add a standard utility function to do this with a simple function call, instead of requiring drivers to open code this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:47 -08:00
Randy Dunlap	138860b953	seq_file: fix new kernel-doc warnings Fix kernel-doc notation in new seq-file functions and correct spelling. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-07 15:48:26 -08:00
Linus Torvalds	b8fa05719b	Revert "lib: build list_sort() only if needed" This reverts commit `a069c266ae`. It turns ou that not only was it missing a case (XFS) that needed it, but perhaps more importantly, people sometimes want to enable new modules that they hadn't had enabled before, and if such a module uses list_sort(), it can't easily be inserted any more. So rather than add a "select LIST_SORT" to the XFS case, just leave it compiled in. It's not all _that_ big, after all, and the inconvenience isn't worth it. Requested-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Don Mullis <don.mullis@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-07 09:54:44 -08:00
Linus Torvalds	66b89159c2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs * git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs: [LogFS] Change magic number [LogFS] Remove h_version field [LogFS] Check feature flags [LogFS] Only write journal if dirty [LogFS] Fix bdev erases [LogFS] Silence gcc [LogFS] Prevent 64bit divisions in hash_index [LogFS] Plug memory leak on error paths [LogFS] Add MAINTAINERS entry [LogFS] add new flash file system Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in fs/logfs/inode.c introduced by write_inode() being changed to use writeback_control' by commit `a9185b41a4` ("pass writeback_control to ->write_inode")	2010-03-06 13:18:03 -08:00
Linus Torvalds	66ce3cf84d	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (21 commits) xfs: return inode fork offset in bulkstat for fsr xfs: Increase the default size of the reserved blocks pool xfs: truncate delalloc extents when IO fails in writeback xfs: check for more work before sleeping in xfssyncd xfs: Fix a build warning in xfs_aops.c xfs: fix locking for inode cache radix tree tag updates xfs: remove xfs_ipin/xfs_iunpin xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait xfs: kill xfs_lrw.h xfs: factor common xfs_trans_bjoin code xfs: stop passing opaque handles to xfs_log.c routines xfs: split xfs_bmap_btalloc xfs: fix xfs_fsblock_t tracing xfs: fix inode pincount check in fsync xfs: Non-blocking inode locking in IO completion xfs: implement optimized fdatasync xfs: remove wrapper for the fsync file operation xfs: remove wrappers for read/write file operations xfs: merge xfs_lrw.c into xfs_file.c xfs: fix dquota trace format ...	2010-03-06 11:32:21 -08:00
Linus Torvalds	05c5cb31ec	Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux * 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits) nfsd4: fix minor memory leak svcrpc: treat uid's as unsigned nfsd: ensure sockets are closed on error Revert "sunrpc: move the close processing after do recvfrom method" Revert "sunrpc: fix peername failed on closed listener" sunrpc: remove unnecessary svc_xprt_put NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN xfs_export_operations.commit_metadata commit_metadata export operation replacing nfsd_sync_dir lockd: don't clear sm_monitored on nsm_reboot_lookup lockd: release reference to nsm_handle in nlm_host_rebooted nfsd: Use vfs_fsync_range() in nfsd_commit NFSD: Create PF_INET6 listener in write_ports SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found" SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt() NFSD: Support AF_INET6 in svc_addsock() function SUNRPC: Use rpc_pton() in ip_map_parse() nfsd: 4.1 has an rfc number nfsd41: Create the recovery entry for the NFSv4.1 client nfsd: use vfs_fsync for non-directories ...	2010-03-06 11:31:38 -08:00
Neil Horman	76595f79d7	coredump: suppress uid comparison test if core output files are pipes Modify uid check in do_coredump so as to not apply it in the case of pipes. This just got noticed in testing. The end of do_coredump validates the uid of the inode for the created file against the uid of the crashing process to ensure that no one can pre-create a core file with different ownership and grab the information contained in the core when they shouldn' tbe able to. This causes failures when using pipes for a core dumps if the crashing process is not root, which is the uid of the pipe when it is created. The fix is simple. Since the check for matching uid's isn't relevant for pipes (a process can't create a pipe that the uermodehelper code will open anyway), we can just just skip it in the event ispipe is non-zero Reverts a pipe-affecting change which was accidentally made in : commit `c46f739dd3` : Author: Ingo Molnar <mingo@elte.hu> : AuthorDate: Wed Nov 28 13:59:18 2007 +0100 : Commit: Linus Torvalds <torvalds@woody.linux-foundation.org> : CommitDate: Wed Nov 28 10:58:01 2007 -0800 : : vfs: coredumping fix Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Oleg Nesterov	5c99cbf49a	coredump: set ->group_exit_code for other CLONE_VM tasks too User visible change. do_coredump() kills all threads which share the same ->mm but only the coredumping process gets the proper exit_code. Other tasks which share the same ->mm die "silently" and return status == 0 to parent. This is historical behaviour, not actually a bug. But I think Frank Heckenbach rightly dislikes the current behaviour. Simple test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/wait.h> int main(void) { int stat; if (!fork()) { if (!vfork()) kill(getpid(), SIGQUIT); } wait(&stat); printf("stat=%x\n", stat); return 0; } Before this patch it prints "stat=0" despite the fact the child was killed by SIGQUIT. After this patch the output is "stat=3" which obviously makes more sense. Even with this patch, only the task which originates the coredumping gets "\|= 0x80" if the core was actually dumped, but at least the coredumping signal is visible to do_wait/etc. Reported-by: Frank Heckenbach <f.heckenbach@fh-soft.de> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Masami Hiramatsu	30736a4d43	coredump: pass mm->flags as a coredump parameter for consistency Pass mm->flags as a coredump parameter for consistency. --- 1787 if (mm->core_state \|\| !get_dumpable(mm)) { <- (1) 1788 up_write(&mm->mmap_sem); 1789 put_cred(cred); 1790 goto fail; 1791 } 1792 [...] 1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode / <-(2) 1799 flag = O_EXCL; / Stop rewrite attacks / 1800 cred->fsuid = 0; / Dump root private / 1801 } --- Since dumpable bits are not protected by lock, there is a chance to change these bits between (1) and (2). To solve this issue, this patch copies mm->flags to coredump_params.mm_flags at the beginning of do_coredump() and uses it instead of get_dumpable() while dumping core. This copy is also passed to binfmt->core_dump, since elf_core_dump() uses dump_filter bits in mm->flags. [akpm@linux-foundation.org: fix merge] Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Daisuke HATAYAMA	8d9032bbe4	elf coredump: add extended numbering support The current ELF dumper implementation can produce broken corefiles if program headers exceed 65535. This number is determined by the number of vmas which the process have. In particular, some extreme programs may use more than 65535 vmas. (If you google max_map_count, you can find some users facing this problem.) This kind of program never be able to generate correct coredumps. This patch implements ``extended numbering'' that uses sh_info field of the first section header instead of e_phnum field in order to represent upto 4294967295 vmas. This is supported by AMD64-ABI(http://www.x86-64.org/documentation.html) and Solaris(http://docs.sun.com/app/docs/doc/817-1984/). Of course, we are preparing patches for gdb and binutils. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Daisuke HATAYAMA	93eb211e6c	elf coredump: make offset calculation process and writing process explicit By the next patch, elf_core_dump() and elf_fdpic_core_dump() will support extended numbering and so will produce the corefiles with section header table in a special case. The problem is the process of writing a file header offset of the section header table into e_shoff field of the ELF header. ELF header is positioned at the beginning of the corefile, while section header at the end. So, we need to take which of the following ways: 1. Seek backward to retry writing operation for ELF header after writing process for a whole part 2. Make offset calculation process and writing process totally sequential The clause 1. is not always possible: one cannot assume that file system supports seek function. Consider the no_llseek case. Therefore, this patch adopts the clause 2. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Daisuke HATAYAMA	1fcccbac89	elf coredump: replace ELF_CORE_EXTRA_* macros by functions elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding macro for hiding _multiline_ logics in functions. This patch removes #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in order to reduce a range of modification. This cleanup is for my next patches, but I think this cleanup itself is worth doing regardless of my firnal purpose. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Daisuke HATAYAMA	088e7af73a	coredump: move dump_write() and dump_seek() into a header file My next patch will replace ELF_CORE_EXTRA_* macros by functions, putting them into other newly created .c files. Then, each files will contain dump_write(), where each pair of binfmt_.c and elfcore.c should be the same. So, this patch moves them into a header file with dump_seek(). Also, the patch deletes confusing DUMP_WRITE macros in each files. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Daisuke HATAYAMA	05f47fda9f	coredump: unify dump_seek() implementations for each binfmt_.c The current ELF dumper can produce broken corefiles if program headers exceed 65535. In particular, the program in 64-bit environment often demands more than 65535 mmaps. If you google max_map_count, then you can find many users facing this problem. Solaris has already dealt with this issue, and other OSes have also adopted the same method as in Solaris. Currently, Sun's document and AMD 64 ABI include the description for the extension, where they call the extension Extended Numbering. See Reference for further information. I believe that linux kernel should adopt the same way as they did, so I've written this patch. I am also preparing for patches of GDB and binutils. How to fix ========== In new dumping process, there are two cases according to weather or not the number of program headers is equal to or more than 65535. - if less than 65535, the produced corefile format is exactly the same as the ordinary one. - if equal to or more than 65535, then e_phnum field is set to newly introduced constant PN_XNUM(0xffff) and the actual number of program headers is set to sh_info field of the section header at index 0. Compatibility Concern ===================== As already mentioned in Summary, Sun and AMD64 has already adopted this. See Reference. * There are four combinations according to whether kernel and userland tools are respectively modified or not. The next table summarizes shortly for each combination. --------------------------------------------- Original Kernel \| Modified Kernel --------------------------------------------- < 65535 \| >= 65535 \| < 65535 \| >= 65535 ------------------------------------------------------------- Original Tools \| OK \| broken \| OK \| broken (#) ------------------------------------------------------------- Modified Tools \| OK \| broken \| OK \| OK ------------------------------------------------------------- Note that there is no case that `OK' changes to `broken'. (#) Although this case remains broken, O-M behaves better than O-O. That is, while in O-O case e_phnum field would be extremely small due to integer overflow, in O-M case it is guaranteed to be at least 65535 by being set to PN_XNUM(0xFFFF), much closer to the actual correct value than the O-O case. Test Program ============ Here is a test program mkmmaps.c that is useful to produce the corefile with many mmaps. To use this, please take the following steps: $ ulimit -c unlimited $ sysctl vm.max_map_count=70000 # default 65530 is too small $ sysctl fs.file-max=70000 $ mkmmaps 65535 Then, the program will abort and a corefile will be generated. If failed, there are two cases according to the error message displayed. * ``out of memory'' means vm.max_map_count is still smaller * ``too many open files'' means fs.file-max is still smaller So, please change it to a larger value, and then retry it. mkmmaps.c == #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char *argv) { int maps_num; if (argc < 2) { fprintf(stderr, "mkmmaps [number of maps to be created]\n"); exit(1); } if (sscanf(argv[1], "%d", &maps_num) == EOF) { perror("sscanf"); exit(2); } if (maps_num < 0) { fprintf(stderr, "%d is invalid\n", maps_num); exit(3); } for (; maps_num > 0; --maps_num) { if (MAP_FAILED == mmap((void )NULL, (size_t) 1, PROT_READ, MAP_SHARED \| MAP_ANONYMOUS, (int) -1, (off_t) NULL)) { perror("mmap"); exit(4); } } abort(); { char buffer[128]; sprintf(buffer, "wc -l /proc/%u/maps", getpid()); system(buffer); } return 0; } Tested on i386, ia64 and um/sys-i386. Built on sh4 (which covers fs/binfmt_elf_fdpic.c) References ========== - Sun microsystems: Linker and Libraries. Part No: 817-1984-17, September 2008. URL: http://docs.sun.com/app/docs/doc/817-1984 - System V ABI AMD64 Architecture Processor Supplement Draft Version 0.99., May 11, 2009. URL: http://www.x86-64.org/ This patch: There are three different definitions for dump_seek() functions in binfmt_aout.c, binfmt_elf.c and binfmt_elf_fdpic.c, respectively. The only for binfmt_elf.c. My next patch will move dump_seek() into a header file in order to share the same implementations for dump_write() and dump_seek(). As the first step, this patch unify these three definitions for dump_seek() by applying the past commits that have been applied only for binfmt_elf.c. Specifically, the modification made here is part of the following commits: * `d025c9db7f` * `7f14daa19e` This patch does not change a shape of corefiles. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Alexey Dobriyan	12bac0d9f4	proc: warn on non-existing proc entries * warn if creation goes on to non-existent directory * warn if removal goes on from non-existing directory * warn if non-existing proc entry is removed Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Alexey Dobriyan	e17a5765f2	proc: do translation + unlink atomically at remove_proc_entry() remove_proc_entry() does lock lookup parent unlock lock unlink proc entry from lists unlock which can be made bit more correct by doing parent translation + unlink without dropping lock. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Andrew Morton	45bf5cd7be	fs/compat_ioctl.c: suppress two warnings fs/compat_ioctl.c: In function 'do_ioctl_trans': fs/compat_ioctl.c:534: warning: 'karg' may be used uninitialized in this function fs/compat_ioctl.c:533: warning: 'kcmd' may be used uninitialized in this function fs/compat_ioctl.c:656: warning: 'ret' may be used uninitialized in this function Reduces text size by 44 bytes. If someone calls one of these functions with an unexpected argument, the code's buggy as-is. Amerigo Wang <amwang@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:35 -08:00
Don Mullis	a069c266ae	lib: build list_sort() only if needed Build list_sort() only for configs that need it -- those that don't save ~581 bytes (i386). Signed-off-by: Don Mullis <don.mullis@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:35 -08:00
Michael Neuling	5ef097dd7b	exec: create initial stack independent of PAGE_SIZE Currently we create the initial stack based on the PAGE_SIZE. This is unnecessary. This creates this initial stack independent of the PAGE_SIZE. It also bumps up the number of 4k pages allocated from 20 to 32, to align with 64K page systems. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Helge Deller <deller@gmx.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:33 -08:00
Jiri Slaby	d554ed895d	fs: use rlimit helpers Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit `3e10e716ab` ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:29 -08:00
Rik van Riel	5beb493052	mm: change anon_vma linking to fix multi-process server scalability issue The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:26 -08:00
Wu Fengguang	42e4960868	vfs: take f_lock on modifying f_mode after open time We'll introduce FMODE_RANDOM which will be runtime modified. So protect all runtime modification to f_mode with f_lock to avoid races. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.org> [2.6.33.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:25 -08:00
KAMEZAWA Hiroyuki	b084d4353f	mm: count swap usage A frequent questions from users about memory management is what numbers of swap ents are user for processes. And this information will give some hints to oom-killer. Besides we can count the number of swapents per a process by scanning /proc/<pid>/smaps, this is very slow and not good for usual process information handler which works like 'ps' or 'top'. (ps or top is now enough slow..) This patch adds a counter of swapents to mm_counter and update is at each swap events. Information is exported via /proc/<pid>/status file as [kamezawa@bluextal memory]$ cat /proc/self/status Name: cat State: R (running) Tgid: 2910 Pid: 2910 PPid: 2823 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmPeak: 82696 kB VmSize: 82696 kB VmLck: 0 kB VmHWM: 432 kB VmRSS: 432 kB VmData: 172 kB VmStk: 84 kB VmExe: 48 kB VmLib: 1568 kB VmPTE: 40 kB VmSwap: 0 kB <=============== this. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:24 -08:00
KAMEZAWA Hiroyuki	34e55232e5	mm: avoid false sharing of mm_counter Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:24 -08:00
KAMEZAWA Hiroyuki	d559db086f	mm: clean up mm_counter Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:23 -08:00
Akinobu Mita	984b3f5746	bitops: rename for_each_bit() to for_each_set_bit() Rename for_each_bit to for_each_set_bit in the kernel source tree. To permit for_each_clear_bit(), should that ever be added. The patch includes a macro to map the old for_each_bit() onto the new for_each_set_bit(). This is a (very) temporary thing to ease the migration. [akpm@linux-foundation.org: add temporary for_each_bit()] Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:23 -08:00
Al Viro	781b16775b	Fix a dumb typo - use of & instead of && We managed to lose O_DIRECTORY testing due to a stupid typo in commit `1f36f774b2` ("Switch !O_CREAT case to use of do_last()") Reported-by: Walter Sheets <w41ter@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 10:54:48 -08:00
Joern Engel	c2f843f03d	[LogFS] Change magic number Many changes were made during development that could result in old versions of mklogfs and the kernel code being subtly incompatible. Not being a friend of subtleties, I hereby change the magic number. Any old version of mklogfs is now guaranteed to fail.	2010-03-06 10:03:11 +01:00
Joern Engel	9cf05b416d	[LogFS] Remove h_version field Incompatible change: h_compr is moved up so the padding is all in one chunk.	2010-03-06 10:01:46 +01:00
Jeff Layton	c8634fd311	cifs: add a CIFSSMBUnixQFileInfo function ...to allow us to get unix attrs via filehandle. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-06 04:38:25 +00:00
Jeff Layton	bcd5357f43	cifs: add a CIFSSMBQFileInfo function ...to get inode attributes via filehandle instead of by path. In some places, we need to revalidate an inode on an open filehandle, but we can't necessarily guarantee that the dentry associated with it will still be valid. When we have an open filehandle already, it makes more sense to do a filehandle based operation anyway. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-06 04:38:02 +00:00
Jeff Layton	df2cf170c8	cifs: overhaul cifs_revalidate and rename to cifs_revalidate_dentry cifs_revalidate is renamed to cifs_revalidate_dentry as a later patch will add a by-filehandle variant. Add a new "invalid_mapping" flag to the cifsInodeInfo that indicates that the pagecache is considered invalid. Add a new routine to check inode attributes whenever they're updated and set that flag if the inode has changed on the server. cifs_revalidate_dentry is then changed to just update the attrcache if needed and then to zap the pagecache if it's not valid. There are some other behavior changes in here as well. Open files are now allowed to have their caches invalidated. I see no reason why we'd want to keep stale data around just because a file is open. Also, cifs_revalidate_cache uses the server_eof for revalidating the file size since that should more closely match the size of the file on the server. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-03-06 04:37:05 +00:00
Stephen Rothwell	f1a3d57213	ceph: update for write_inode API change Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-05 14:49:41 -08:00
Linus Torvalds	cc7889ff5e	Merge branch 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.34' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (44 commits) NFS: Remove requirement for inode->i_mutex from nfs_invalidate_mapping NFS: Clean up nfs_sync_mapping NFS: Simplify nfs_wb_page() NFS: Replace __nfs_write_mapping with sync_inode() NFS: Simplify nfs_wb_page_cancel() NFS: Ensure inode is always marked I_DIRTY_DATASYNC, if it has unstable pages NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set NFS: Reduce the number of unnecessary COMMIT calls NFS: Add a count of the number of unstable writes carried by an inode NFS: Cleanup - move nfs_write_inode() into fs/nfs/write.c nfs41 fix NFS4ERR_CLID_INUSE for exchange id NFS: Fix an allocation-under-spinlock bug SUNRPC: Handle EINVAL error returns from the TCP connect operation NFSv4.1: Various fixes to the sequence flag error handling nfs4: renewd renew operations should take/put a client reference nfs41: renewd sequence operations should take/put client reference nfs: prevent backlogging of renewd requests nfs: kill renewd before clearing client minor version NFS: Make close(2) asynchronous when closing NFS O_DIRECT files NFS: Improve NFS iostat byte count accuracy for writes ...	2010-03-05 13:25:45 -08:00
Linus Torvalds	b13d3c6e8a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: Add hardlink support to .u extension 9P2010.L handshake: .L protocol negotiation 9P2010.L handshake: Remove "dotu" variable 9P2010.L handshake: Add mount option 9P2010.L handshake: Add VFS flags net/9p: Handle mount errors correctly. net/9p: Remove MAX_9P_CHAN limit net/9p: Add multi channel support.	2010-03-05 13:25:24 -08:00

1 2 3 4 5 ...

17466 Коммитов