WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Josef Bacik	27a377db74	Btrfs: don't loop forever if we can't run because of the tree mod log A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
David Sterba	8051aa1a3d	btrfs: reserve no transaction units in btrfs_ioctl_set_features Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Jeff Mahoney	d0270aca88	btrfs: commit transaction after setting label and features The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Josef Bacik	6cc98d90f8	Btrfs: fix assert screwup for the pending move stuff Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set, which meant that a key part of Filipe's original patch was not being built in. This appears to be a mess up with merging Filipe's patch as it does not exist in his original patch. Fix this by changing how we make sure del_waiting_dir_move asserts that it did not error and take the function out of the ifdef check. This makes btrfs/030 pass with the assert on or off. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Linus Torvalds	ec2e6cb24a	Fix regression -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJS9mGFAAoJEDaohF61QIxkMhMP/0657lE8Wn+thff6kX7mNs1v Ctxc6V2723t2D4xGkIuOrYMqsziRkUeXKZOQTKvRYes95RkA1q3GUvv+xDwP1kVI f6Mqg3TZOiCTm/3z8uJUlYq1q6Hjp8E8QWg3Qs9sFEYPh7WFo2BWiDLjIA2iKkr8 YJx7jiRE2lzWw2DPBSY7cv9IUuJkgaIRMaMI477adU0m/BvmG73waG9NAo0sjmdK RPTC4/j2p8+PooPzXS7dHUkbxbTnqSIGuXEu3XED1Hsj2vqzIbi3zw0cvRuK02D4 LpsKIcyaEG6ApxK7RdZJeauSmpok4s/sPWwJIqn+1X4foQF28AwCZY0HIe1q/W25 +i99soVoX04Uby3l02PKKlHZLAfQo0GTl/3itpG2nV84SB7i2T2Quwi8X7q6iqrK +2r2Ro7upTGmRZqyCTyZX+/rEFLMGCdSVUmPnz+NWISmTTIvaXg82A7eTJtvWdXv TFjPij++s4fJEH8D4RmpgDlCBlCoqrZvl2tR1YWdKyDOpFMcBEuVJ8mNGSadJT8v ekLejc3CVDyLj6xscwFmWXTQEwVOKtoSHBlpFqSIBORJnOcN+59Vb3EKv9IhD/O5 oLdcDuBSPUsOw+PMe5n4xjjr7bMrIuf4PiIWiOPJ8Y1dlOmGZKiEWyS4devZM4KG PyhJKEO/JUw3FLT9Ni3M =QFjr -----END PGP SIGNATURE----- Merge tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy Pull jfs fix from David Kleikamp: "Fix regression" * tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy: jfs: fix generic posix ACL regression	2014-02-08 10:13:47 -08:00
Dave Kleikamp	c18f7b5120	jfs: fix generic posix ACL regression I missed a couple errors in reviewing the patches converting jfs to use the generic posix ACL function. Setting ACL's currently fails with -EOPNOTSUPP. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reported-by: Michael L. Semon <mlsemon35@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2014-02-08 10:50:58 -06:00
Steve French	4a5c80d7b5	[CIFS] clean up page array when uncached write send fails In the event that a send fails in an uncached write, or we end up needing to reissue it (-EAGAIN case), we'll kfree the wdata but the pages currently leak. Fix this by adding a new kref release routine for uncached writedata that releases the pages, and have the uncached codepaths use that. [original patch by Jeff modified to fix minor formatting problems] Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2014-02-07 20:47:00 -06:00
Jeff Layton	26c8f0d601	cifs: use a flexarray in cifs_writedata The cifs_writedata code uses a single element trailing array, which just adds unneeded complexity. Use a flexarray instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2014-02-07 20:38:29 -06:00
Tejun Heo	ba341d55a4	kernfs: add CONFIG_KERNFS As sysfs was kernfs's only user, kernfs has been piggybacking on CONFIG_SYSFS; however, kernfs is scheduled to grow a new user very soon. Introduce a separate config option CONFIG_KERNFS which is to be selected by kernfs users. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:08:57 -08:00
Tejun Heo	3eef34ad7d	kernfs: implement kernfs_get_parent(), kernfs_name/path() and friends kernfs_node->parent and ->name are currently marked as "published" indicating that kernfs users may access them directly; however, those fields may get updated by kernfs_rename[_ns]() and unrestricted access may lead to erroneous values or oops. Protect ->parent and ->name updates with a irq-safe spinlock kernfs_rename_lock and implement the following accessors for these fields. * kernfs_name() - format the node's name into the specified buffer * kernfs_path() - format the node's path into the specified buffer * pr_cont_kernfs_name() - pr_cont a node's name (doesn't need buffer) * pr_cont_kernfs_path() - pr_cont a node's path (doesn't need buffer) * kernfs_get_parent() - pin and return a node's parent All can be called under any context. The recursive sysfs_pathname() in fs/sysfs/dir.c is replaced with kernfs_path() and sysfs_rename_dir_ns() is updated to use kernfs_get_parent() instead of dereferencing parent directly. v2: Dummy definition of kernfs_path() for !CONFIG_KERNFS was missing static inline making it cause a lot of build warnings. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:05:35 -08:00
Tejun Heo	0c23b2259a	kernfs: implement kernfs_node_from_dentry(), kernfs_root_from_sb() and kernfs_rename() Implement helpers to determine node from dentry and root from super_block. Also add a kernfs_rename_ns() wrapper which assumes NULL namespace. These generally make sense and will be used by cgroup. v2: Some dummy implementations for !CONFIG_SYSFS was missing. Fixed. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 16:00:41 -08:00
Tejun Heo	4d3773c4bb	kernfs: implement kernfs_ops->atomic_write_len A write to a kernfs_node is buffered through a kernel buffer. Writes <= PAGE_SIZE are performed atomically, while larger ones are executed in PAGE_SIZE chunks. While this is enough for sysfs, cgroup which is scheduled to be converted to use kernfs needs a bit more control over it. This patch adds kernfs_ops->atomic_write_len. If not set (zero), the behavior stays the same. If set, writes upto the size are executed atomically and larger writes are rejected with -E2BIG. A different implementation strategy would be allowing configuring chunking size while making the original write size available to the write method; however, such strategy, while being more complicated, doesn't really buy anything. If the write implementation has to handle chunking, the specific chunk size shouldn't matter all that much. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	d35258ef70	kernfs: allow nodes to be created in the deactivated state Currently, kernfs_nodes are made visible to userland on creation, which makes it difficult for kernfs users to atomically succeed or fail creation of multiple nodes. In addition, if something fails after creating some nodes, the created nodes might already be in use and their active refs need to be drained for removal, which has the potential to introduce tricky reverse locking dependency on active_ref depending on how the error path is synchronized. This patch introduces per-root flag KERNFS_ROOT_CREATE_DEACTIVATED. If set, all nodes under the root are created in the deactivated state and stay invisible to userland until explicitly enabled by the new kernfs_activate() API. Also, nodes which have never been activated are guaranteed to bypass draining on removal thus allowing error paths to not worry about lockding dependency on active_ref draining. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	b9c9dad0c4	kernfs: add missing kernfs_active() checks in directory operations kernfs_iop_lookup(), kernfs_dir_pos() and kernfs_dir_next_pos() were missing kernfs_active() tests before using the found kernfs_node. As deactivated state is currently visible only while a node is being removed, this doesn't pose an actual problem. e.g. lookup succeeding on a deactivated node doesn't harm anything as the eventual file operations are gonna fail and those failures are indistinguishible from the cases in which the lookups had happened before the node was deactivated. However, we're gonna allow new nodes to be created deactivated and then activated explicitly by the kernfs user when it sees fit. This is to support atomically making multiple nodes visible to userland and thus those nodes must not be visible to userland before activated. Let's plug the lookup and readdir holes so that deactivated nodes are invisible to userland. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	6a7fed4eef	kernfs: implement kernfs_syscall_ops->remount_fs() and ->show_options() Add two super_block related syscall callbacks ->remount_fs() and ->show_options() to kernfs_syscall_ops. These simply forward the matching super_operations. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	90c07c895c	kernfs: rename kernfs_dir_ops to kernfs_syscall_ops We're gonna need non-dir syscall callbacks, which will make dir_ops a misnomer. Let's rename kernfs_dir_ops to kernfs_syscall_ops. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	07c7530dd4	kernfs: invoke dir_ops while holding active ref of the target node kernfs_dir_ops are currently being invoked without any active reference, which makes it tricky for the invoked operations to determine whether the objects associated those nodes are safe to access and will remain that way for the duration of such operations. kernfs already has active_ref mechanism to deal with this which makes the removal of a given node the synchronization point for gating the file operations. There's no reason for dir_ops to be any different. Update the dir_ops handling so that active_ref is held while the dir_ops are executing. This guarantees that while a dir_ops is executing the target nodes stay alive. As kernfs_dir_ops doesn't have any in-kernel user at this point, this doesn't affect anybody. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:52:48 -08:00
Tejun Heo	ce8b04aa6c	sysfs, driver-core: remove unused {sysfs\|device}_schedule_callback_owner() All device_schedule_callback_owner() users are converted to use device_remove_file_self(). Remove now unused {sysfs\|device}_schedule_callback_owner(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:41 -08:00
Tejun Heo	6b0afc2a21	kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers Sometimes it's necessary to implement a node which wants to delete nodes including itself. This isn't straightforward because of kernfs active reference. While a file operation is in progress, an active reference is held and kernfs_remove() waits for all such references to drain before completing. For a self-deleting node, this is a deadlock as kernfs_remove() ends up waiting for an active reference that itself is sitting on top of. This currently is worked around in the sysfs layer using sysfs_schedule_callback() which makes such removals asynchronous. While it works, it's rather cumbersome and inherently breaks synchronicity of the operation - the file operation which triggered the operation may complete before the removal is finished (or even started) and the removal may fail asynchronously. If a removal operation is immmediately followed by another operation which expects the specific name to be available (e.g. removal followed by rename onto the same name), there's no way to make the latter operation reliable. The thing is there's no inherent reason for this to be asynchrnous. All that's necessary to do this synchronous is a dedicated operation which drops its own active ref and deactivates self. This patch implements kernfs_remove_self() and its wrappers in sysfs and driver core. kernfs_remove_self() is to be called from one of the file operations, drops the active ref the task is holding, removes the self node, and restores active ref to the dead node so that the ref is balanced afterwards. __kernfs_remove() is updated so that it takes an early exit if the target node is already fully removed so that the active ref restored by kernfs_remove_self() after removal doesn't confuse the deactivation path. This makes implementing self-deleting nodes very easy. The normal removal path doesn't even need to be changed to use kernfs_remove_self() for the self-deleting node. The method can invoke kernfs_remove_self() on itself before proceeding the normal removal path. kernfs_remove() invoked on the node by the normal deletion path will simply be ignored. This will replace sysfs_schedule_callback(). A subtle feature of sysfs_schedule_callback() is that it collapses multiple invocations - even if multiple removals are triggered, the removal callback is run only once. An equivalent effect can be achieved by testing the return value of kernfs_remove_self() - only the one which gets %true return value should proceed with actual deletion. All other instances of kernfs_remove_self() will wait till the enclosing kernfs operation which invoked the winning instance of kernfs_remove_self() finishes and then return %false. This trivially makes all users of kernfs_remove_self() automatically show correct synchronous behavior even when there are multiple concurrent operations - all "echo 1 > delete" instances will finish only after the whole operation is completed by one of the instances. Note that manipulation of active ref is implemented in separate public functions - kernfs_[un]break_active_protection(). kernfs_remove_self() is the only user at the moment but this will be used to cater to more complex cases. v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing and sysfs_remove_file_self() had incorrect return type. Fix it. Reported by kbuild test bot. v3: kernfs_[un]break_active_protection() separated out from kernfs_remove_self() and exposed as public API. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:41 -08:00
Tejun Heo	81c173cb5e	kernfs: remove KERNFS_REMOVED KERNFS_REMOVED is used to mark half-initialized and dying nodes so that they don't show up in lookups and deny adding new nodes under or renaming it; however, its role overlaps that of deactivation. It's necessary to deny addition of new children while removal is in progress; however, this role considerably intersects with deactivation - KERNFS_REMOVED prevents new children while deactivation prevents new file operations. There's no reason to have them separate making things more complex than necessary. This patch removes KERNFS_REMOVED. * Instead of KERNFS_REMOVED, each node now starts its life deactivated. This means that we now use both atomic_add() and atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler generates an overflow warnings when negating INT_MIN as the negation can't be represented as a positive number. Nothing is actually broken but let's bump BIAS by one to avoid the warnings for archs which negates the subtrahend.. * A new helper kernfs_active() which tests whether kn->active >= 0 is added for convenience and lockdep annotation. All KERNFS_REMOVED tests are replaced with negated kernfs_active() tests. * __kernfs_remove() is updated to deactivate, but not drain, all nodes in the subtree instead of setting KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(), which is now renamed to kernfs_drain(). * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with checks on the active ref. * Some comment style updates in the affected area. v2: Reordered before removal path restructuring. kernfs_active() dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE() used in the lookup paths. v3: Reverted most of v2 except for creating a new node with KN_DEACTIVATED_BIAS. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:41 -08:00
Tejun Heo	182fd64b66	kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep() There currently are two mechanisms gating active ref lockdep annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask. The former disables lockdep annotations in kernfs_get/put_active() while the latter disables all of kernfs_deactivate(). While KERNFS_ACTIVE_REF also behaves as an optimization to skip the deactivation step for non-file nodes, the benefit is marginal and it needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF. While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP flag so that it's more convenient and the related code can be compiled out when not enabled. v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag"). As the earlier patch already added KERNFS_LOCKDEP tests to kernfs_deactivate(), those additions are dropped from this patch and the existing ones are simply converted to kernfs_lockdep(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Tejun Heo	988cd7afb3	kernfs: remove kernfs_addrm_cxt kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were added because there were operations which should be performed outside kernfs_mutex after adding and removing kernfs_nodes. The necessary operations were recorded in kernfs_addrm_cxt and performed by kernfs_addrm_finish(); however, after the recent changes which relocated deactivation and unmapping so that they're performed directly during removal, the only operation kernfs_addrm_finish() performs is kernfs_put(), which can be moved inside the removal path too. This patch moves the kernfs_put() of the base ref to __kernfs_remove() and remove kernfs_addrm_cxt and kernfs_addrm_start/finish(). * kernfs_add_one() is updated to grab and release kernfs_mutex itself. sysfs_addrm_start/finish() invocations around it are removed from all users. * __kernfs_remove() puts an unlinked node directly instead of chaining it to kernfs_addrm_cxt. Its callers are updated to grab and release kernfs_mutex instead of calling kernfs_addrm_start/finish() around it. v2: Rebased on top of "kernfs: associate a new kernfs_node with its parent on creation" which dropped @parent from kernfs_add_one(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Tejun Heo	ccf02aaf81	kernfs: invoke kernfs_unmap_bin_file() directly from kernfs_deactivate() kernfs_unmap_bin_file() is supposed to unmap all memory mappings of the target file before kernfs_remove() finishes; however, it currently is being called from kernfs_addrm_finish() and has the same race problem as the original implementation of deactivation when there are multiple removers - only the remover which snatches the node to its addrm_cxt->removed list is guaranteed to wait for its completion before returning. It can be easily fixed by moving kernfs_unmap_bin_file() invocation from kernfs_addrm_finish() to kernfs_deactivated(). The function may be called multiple times but that shouldn't do any harm. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Tejun Heo	35beab0635	kernfs: restructure removal path to fix possible premature return The recursive nature of kernfs_remove() means that, even if kernfs_remove() is not allowed to be called multiple times on the same node, there may be race conditions between removal of parent and its descendants. While we can claim that kernfs_remove() shouldn't be called on one of the descendants while the removal of an ancestor is in progress, such rule is unnecessarily restrictive and very difficult to enforce. It's better to simply allow invoking kernfs_remove() as the caller sees fit as long as the caller ensures that the node is accessible. The current behavior in such situations is broken. Whoever enters removal path first takes the node off the hierarchy and then deactivates. Following removers either return as soon as it notices that it's not the first one or can't even find the target node as it has already been removed from the hierarchy. In both cases, the following removers may finish prematurely while the nodes which should be removed and drained are still being processed by the first one. This patch restructures so that multiple removers, whether through recursion or direction invocation, always follow the following rules. * When there are multiple concurrent removers, only one puts the base ref. * Regardless of which one puts the base ref, all removers are blocked until the target node is fully deactivated and removed. To achieve the above, removal path now first marks all descendants including self REMOVED and then deactivates and unlinks leftmost descendant one-by-one. kernfs_deactivate() is called directly from __kernfs_removal() and drops and regrabs kernfs_mutex for each descendant to drain active refs. As this means that multiple removers can enter kernfs_deactivate() for the same node, the function is updated so that it can handle multiple deactivators of the same node - only one actually deactivates but all wait till drain completion. The restructured removal path guarantees that a removed node gets unlinked only after the node is deactivated and drained. Combined with proper multiple deactivator handling, this guarantees that any invocation of kernfs_remove() returns only after the node itself and all its descendants are deactivated, drained and removed. v2: Draining separated into a separate loop (used to be in the same loop as unlink) and done from __kernfs_deactivate(). This is to allow exposing deactivation as a separate interface later. Root node removal was broken in v1 patch. Fixed. v3: Revert most of v2 except for root node removal fix and simplification of KERNFS_REMOVED setting loop. v4: Refreshed on top of ("kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag"). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Tejun Heo	abd54f028e	kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq kernfs_node->u.completion is used to notify deactivation completion from kernfs_put_active() to kernfs_deactivate(). We now allow multiple racing removals of the same node and the current removal scheme is no longer correct - kernfs_remove() invocation may return before the node is properly deactivated if it races against another removal. The removal path will be restructured to address the issue. To help such restructure which requires supporting multiple waiters, this patch replaces kernfs_node->u.completion with kernfs_root->deactivate_waitq. This makes deactivation event notifications share a per-root waitqueue_head; however, the wait path is quite cold and this will also allow shaving one pointer off kernfs_node. v2: Refreshed on top of ("kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag"). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Tejun Heo	a6607930b6	kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set before performing lockdep annotations and ends up feeding uninitialized lockdep_map to lockdep triggering warning like the following on USB stick hotunplug. usb 1-2: USB disconnect, device number 2 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82 Hardware name: empty empty/S3992, BIOS 080011 10/26/2007 ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002 ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002 ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710 Call Trace: [<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a [<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70 [<ffffffff810f931a>] lock_acquire+0x9a/0x1d0 [<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130 [<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60 [<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0 [<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80 [<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0 [<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50 [<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80 [<ffffffff8177e25e>] device_del+0x12e/0x1d0 [<ffffffff819722c2>] usb_disconnect+0x122/0x1a0 [<ffffffff819749b5>] hub_thread+0x3c5/0x1290 [<ffffffff810c6a6d>] kthread+0xed/0x110 [<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0 Fix it by making kernfs_deactivate() perform lockdep annotations only if KERNFS_LOCKDEP is set. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fabio Estevam <festevam@gmail.com> Reported-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:42:40 -08:00
Linus Torvalds	34a9bff4ab	Driver core fix for 3.14-rc2 Here is a single kernfs fix to resolve a much-reported lockdep issue with the removal of entries in sysfs. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEABECAAYFAlL1WqIACgkQMUfUDdst+ykauACeMlmM0Ro0nCjU5e9Dq9qFw0ZJ u2oAn3qxgNIRKIjZTxDfXmXgFr0sTfTW =UyO3 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here is a single kernfs fix to resolve a much-reported lockdep issue with the removal of entries in sysfs" * tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag	2014-02-07 14:17:18 -08:00
Martin K. Petersen	087787959c	block: Fix nr_vecs for inline integrity vectors Commit `9f060e2231` changed the way we handle allocations for the integrity vectors. When the vectors are inline there is no associated slab and consequently bvec_nr_vecs() returns 0. Ensure that we check against BIP_INLINE_VECS in that case. Reported-by: David Milburn <dmilburn@redhat.com> Tested-by: David Milburn <dmilburn@redhat.com> Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-02-07 13:49:48 -07:00
Steve French	83e3bc23ef	retrieving CIFS ACLs when mounted with SMB2 fails dropping session The get/set ACL xattr support for CIFS ACLs attempts to send old cifs dialect protocol requests even when mounted with SMB2 or later dialects. Sending cifs requests on an smb2 session causes problems - the server drops the session due to the illegal request. This patch makes CIFS ACL operations protocol specific to fix that. Attempting to query/set CIFS ACLs for SMB2 will now return EOPNOTSUPP (until we add worker routines for sending query ACL requests via SMB2) instead of sending invalid (cifs) requests. A separate followon patch will be needed to fix cifs_acl_to_fattr (which takes a cifs specific u16 fid so can't be abstracted to work with SMB2 until that is changed) and will be needed to fix mount problems when "cifsacl" is specified on mount with e.g. vers=2.1 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> CC: Stable <stable@kernel.org>	2014-02-07 11:08:17 -06:00
Steve French	d979f3b0a1	Add protocol specific operation for CIFS xattrs Changeset `666753c3ef` added protocol operations for get/setxattr to avoid calling cifs operations on smb2/smb3 mounts for xattr operations and this changeset adds the calls to cifs specific protocol operations for xattrs (in order to reenable cifs support for xattrs which was temporarily disabled by the previous changeset. We do not have SMB2/SMB3 worker function for setting xattrs yet so this only enables it for cifs. CCing stable since without these two small changsets (its small coreq `666753c3ef` is also needed) calling getfattr/setfattr on smb2/smb3 mounts causes problems. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> CC: Stable <stable@kernel.org>	2014-02-07 11:08:15 -06:00
KOSAKI Motohiro	227d53b397	mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq To use spin_{un}lock_irq is dangerous if caller disabled interrupt. During aio buffer migration, we have a possibility to see the following call stack. aio_migratepage [disable interrupt] migrate_page_copy clear_page_dirty_for_io set_page_dirty __set_page_dirty_buffers __set_page_dirty spin_lock_irq This mean, current aio migration is a deadlockable. spin_lock_irqsave is a safer alternative and we should use it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Rientjes rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-06 13:48:51 -08:00
Zongxun Wang	fb951eb5e1	ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters Even if using the same jbd2 handle, we cannot rollback a transaction. So once some error occurs after successfully allocating clusters, the allocated clusters will never be used and it means they are lost. For example, call ocfs2_claim_clusters successfully when expanding a file, but failed in ocfs2_insert_extent. So we need free the allocated clusters if they are not used indeed. Signed-off-by: Zongxun Wang <wangzongxun@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-06 13:48:51 -08:00
Linus Torvalds	c4ad8f98be	execve: use 'struct filename *' for executable name passing This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-05 12:54:53 -08:00
Tejun Heo	da9846ae15	kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set before performing lockdep annotations and ends up feeding uninitialized lockdep_map to lockdep triggering warning like the following on USB stick hotunplug. usb 1-2: USB disconnect, device number 2 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82 Hardware name: empty empty/S3992, BIOS 080011 10/26/2007 ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002 ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002 ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710 Call Trace: [<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a [<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70 [<ffffffff810f931a>] lock_acquire+0x9a/0x1d0 [<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130 [<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60 [<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0 [<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80 [<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0 [<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50 [<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80 [<ffffffff8177e25e>] device_del+0x12e/0x1d0 [<ffffffff819722c2>] usb_disconnect+0x122/0x1a0 [<ffffffff819749b5>] hub_thread+0x3c5/0x1290 [<ffffffff810c6a6d>] kthread+0xed/0x110 [<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0 Fix it by making kernfs_deactivate() perform lockdep annotations only if KERNFS_LOCKDEP is set. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fabio Estevam <festevam@gmail.com> Reported-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Jiri Kosina <jkosina@suse.cz> Reported-by: Dave Jones <davej@redhat.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-05 11:44:04 -08:00
Linus Torvalds	878a876b2e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is fixing compile and boot problems with our crc32c rework, and Josef has disabled snapshot aware defrag for now. As the number of snapshots increases, we're hitting OOM. For the short term we're disabling things until a bigger fix is ready" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use late_initcall instead of module_init Btrfs: use btrfs_crc32c everywhere instead of libcrc32c Btrfs: disable snapshot aware defrag for now	2014-02-04 12:26:56 -08:00
Linus Torvalds	d7512f79fd	NFS client bugfixes for Linux 3.14 Highlights: - Fix NFSv3 acl regressions - Fix NFSv4 memory corruption due to slot table abuse in nfs4_proc_open_confirm - nfs4_destroy_session must call rpc_destroy_waitqueue -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJS8RmYAAoJEGcL54qWCgDynYEP/1Ip6/IoaqOfT5kAYo05tw30 Wqk0vNZWqdkuK/T8kUQmcbaN6Gt29QRhhJzDBhIrQ+7n8LsWwiZ3zoi9m3p2Ry7U M+xWcE6P7Cc44w3XLfll/xKb3ktfhXFQUeTHxsQ3X7uBsBP8TcfYabWHFrrkSe2B aubj/JqebLm2Uz7Zb03T4JF6LKd+3uHBr7SDK03zTUaPtn+bcVGBwS9MlFTt17Ma 2zMFMM4TNNPDGJWJMvI460BYB4nySINCA4I+78sSLOQbiDc6yZd1Zq/Ngtvs+AlW /+m3eEY3ljcawCSAHMkoef4G5ljHV8cndtr3WBtAUjGq/H6V8eGdW4PB/A2UGYRH Vdcxrm6Bn7IDRJ6baq72DuX1ZyTDPBllRz0m0l08w8p2cuK7lFDSF7/HS2MNOb2i 8Ai02CG9+aWZCz1uuF0vngKViP+hfmJqrLr1OgLuSggPd1kbqU/+HnNGtLmDfi/0 K0ceuMvIuMlahlYgb7XbGFN7RIcbIYXXhBM2dUQo+/TtGvfjHVI8EmmlC7CtEXdW TeVc9gxWgKIlF1p06aoWFzdbBkK3CtNCXgh8vQpFRSqvoAQC/mBVQW9Iank77hRA 94wiNZVLKaEO+l2WmfuGLhZ9kkLb/Iq5NVA5IUGWgjyT1yJsR5faTFfRhM+yjABE IwQF5ntXqxz82nWCRbSl =kKEV -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix NFSv3 acl regressions - Fix NFSv4 memory corruption due to slot table abuse in nfs4_proc_open_confirm - nfs4_destroy_session must call rpc_destroy_waitqueue" * tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: fs: get_acl() must be allowed to return EOPNOTSUPP NFSv3: Fix return value of nfs3_proc_setacls NFSv3: Remove unused function nfs3_proc_set_default_acl NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue NFSv4: Fix memory corruption in nfs4_proc_open_confirm nfs: fix setting of ACLs on file creation.	2014-02-04 12:26:16 -08:00
Trond Myklebust	88a78a912e	Merge branch 'acl_fixes' into linux-next	2014-02-03 17:13:45 -05:00
Trond Myklebust	789b663ae3	fs: get_acl() must be allowed to return EOPNOTSUPP posix_acl_xattr_get requires get_acl() to return EOPNOTSUPP if the filesystem cannot support acls. This is needed for NFS, which can't know whether or not the server supports acls until it tries to get/set one. This patch converts posix_acl_chmod and posix_acl_create to deal with EOPNOTSUPP return values from get_acl(). Reported-by: Russell King <linux@arm.linux.org.uk> Link: http://lkml.kernel.org/r/20140130140834.GW15937@n2100.arm.linux.org.uk Cc: Al Viro viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-02-03 17:12:37 -05:00
Tejun Heo	0a6be65553	nfs: include xattr.h from fs/nfs/nfs3proc.c fs/nfs/nfs3proc.c is making use of xattr but was getting linux/xattr.h indirectly through linux/cgroup.h, which will soon drop the inclusion of xattr.h. Explicitly include linux/xattr.h from nfs3proc.c so that compilation doesn't fail when linux/cgroup.h drops linux/xattr.h. As the following cgroup changes will depend on these changes, it probably would be easier to route this through cgroup branch. Would that be okay? Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: linux-nfs@vger.kernel.org	2014-02-03 15:43:59 -05:00
Trond Myklebust	8f493b9cfc	NFSv3: Fix return value of nfs3_proc_setacls nfs3_proc_setacls is used internally by the NFSv3 create operations to set the acl after the file has been created. If the operation fails because the server doesn't support acls, then it must return '0', not -EOPNOTSUPP. Reported-by: Russell King <linux@arm.linux.org.uk> Link: http://lkml.kernel.org/r/20140201010328.GI15937@n2100.arm.linux.org.uk Cc: Christoph Hellwig <hch@lst.de> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-02-03 13:14:23 -05:00
Trond Myklebust	d4c42fb493	NFSv3: Remove unused function nfs3_proc_set_default_acl Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-02-03 13:13:50 -05:00
Filipe David Borba Manana	60efa5eb2e	Btrfs: use late_initcall instead of module_init It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might not always be already initialized, which results in the call to crypto_alloc_shash() returning -ENOENT, as experienced by Ahmet who reported this. Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which is at initialization level 6, module_init), by using late_initcall (which is at initialization level 7) instead of module_init for btrfs. Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:28 -08:00
Filipe David Borba Manana	0b947aff15	Btrfs: use btrfs_crc32c everywhere instead of libcrc32c After the commit titled "Btrfs: fix btrfs boot when compiled as built-in", LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not possible to build a kernel with btrfs enabled (either as module or built-in) if libcrc32c is not enabled as well. So just replace all uses of libcrc32c with the equivalent function in btrfs hash.h - btrfs_crc32c. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:27 -08:00
Josef Bacik	8101c8dbf6	Btrfs: disable snapshot aware defrag for now It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:27 -08:00
Mikulas Patocka	1c0b8a7a62	hpfs: optimize quad buffer loading HPFS needs to load 4 consecutive 512-byte sectors when accessing the directory nodes or bitmaps. We can't switch to 2048-byte block size because files are allocated in the units of 512-byte sectors. Previously, the driver would allocate a 2048-byte area using kmalloc, copy the data from four buffers to this area and eventually copy them back if they were modified. In the current implementation of the buffer cache, buffers are allocated in the pagecache. That means that 4 consecutive 512-byte buffers are stored in consecutive areas in the kernel address space. So, we don't need to allocate extra memory and copy the content of the buffers there. This patch optimizes the code to avoid copying the buffers. It checks if the four buffers are stored in contiguous memory - if they are not, it falls back to allocating a 2048-byte area and copying data there. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-02 16:24:07 -08:00
Mikulas Patocka	2cbe5c76fc	hpfs: remember free space Previously, hpfs scanned all bitmaps each time the user asked for free space using statfs. This patch changes it so that hpfs scans the bitmaps only once, remembes the free space and on next invocation of statfs it returns the value instantly. New versions of wine are hammering on the statfs syscall very heavily, making some games unplayable when they're stored on hpfs, with load times in minutes. This should be backported to the stable kernels because it fixes user-visible problem (excessive level load times in wine). Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-02 16:24:07 -08:00
H. Peter Anvin	81993e81a9	compat: Get rid of (get\|put)_compat_time(val\|spec) We have two APIs for compatiblity timespec/val, with confusingly similar names. compat_(get\|put)_time(val\|spec) do handle the case where COMPAT_USE_64BIT_TIME is set, whereas (get\|put)_compat_time(val\|spec) do not. This is an accident waiting to happen. Clean it up by favoring the full-service version; the limited version is replaced with double-underscore versions static to kernel/compat.c. A common pattern is to convert a struct timespec to kernel format in an allocation on the user stack. Unfortunately it is open-coded in several places. Since this allocation isn't actually needed if COMPAT_USE_64BIT_TIME is true (since user format == kernel format) encapsulate that whole pattern into the function compat_convert_timespec(). An equivalent function should be written for struct timeval if it is needed in the future. Finally, get rid of compat_(get\|put)_timeval_convert(): each was only used once, and the latter was not even doing what the function said (no conversion actually was being done.) Moving the conversion into compat_sys_settimeofday() itself makes the code much more similar to sys_settimeofday() itself. v3: Remove unused compat_convert_timeval(). v2: Drop bogus "const" in the destination argument for compat_convert_time*(). Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2014-02-02 14:09:12 -08:00
Trond Myklebust	20b9a90245	NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue There may still be timers active on the session waitqueues. Make sure that we kill them before freeing the memory. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-02-01 15:13:39 -05:00
Trond Myklebust	17ead6c85c	NFSv4: Fix memory corruption in nfs4_proc_open_confirm nfs41_wake_and_assign_slot() relies on the task->tk_msg.rpc_argp and task->tk_msg.rpc_resp always pointing to the session sequence arguments. nfs4_proc_open_confirm tries to pull a fast one by reusing the open sequence structure, thus causing corruption of the NFSv4 slot table. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-02-01 15:13:39 -05:00
Pali Rohár	1bda2ac071	afs: proc cells and rootcell are writeable Both proc files are writeable and used for configuring cells. But there is missing correct mode flag for writeable files. Without this patch both proc files are read only. [ It turns out they aren't really read-only, since root can write to them even if the write bit isn't set due to CAP_DAC_OVERRIDE ] Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-02-01 10:59:39 -08:00
Linus Torvalds	0f44bc36ba	Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "A set of cifs fixes (mostly for symlinks, and SMB2 xattrs) and cleanups" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix check for regular file in couldbe_mf_symlink() [CIFS] Fix SMB2 mounts so they don't try to set or get xattrs via cifs CIFS: Cleanup cifs open codepath CIFS: Remove extra indentation in cifs_sfu_type CIFS: Cleanup cifs_mknod CIFS: Cleanup CIFSSMBOpen cifs: Add support for follow_link on dfs shares under posix extensions cifs: move unix extension call to cifs_query_symlink() cifs: Re-order M-F Symlink code cifs: Add create MFSymlinks to protocol ops struct cifs: use protocol specific call for query_mf_symlink() cifs: Rename MF symlink function names cifs: Rename and cleanup open_query_close_cifs_symlink() cifs: Fix memory leak in cifs_hardlink()	2014-02-01 10:52:45 -08:00
Linus Torvalds	efc518eb31	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Several obvious fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix mountpoint reference leakage in linkat hfsplus: use xattr handlers for removexattr Typo in compat_sys_lseek() declaration fs/super.c: sync ro remount after blocking writers vfs: unexport the getname() symbol	2014-02-01 10:43:45 -08:00
Noah Massey	718360c59f	nfs: fix setting of ACLs on file creation. nfs3_get_acl() tries to skip posix equivalent ACLs, but misinterprets the return value of posix_acl_equiv_mode(). Fix it. This is a regression introduced by "nfs: use generic posix ACL infrastructure for v3 Posix ACLs" CC: Christoph Hellwig <hch@infradead.org> CC: linux-nfs@vger.kernel.org CC: linux-fsdevel@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-31 20:36:07 -05:00
Linus Torvalds	8a1f006ad3	NFS client bugfixes for Linux 3.14 Highlights: - Fix several races in nfs_revalidate_mapping - NFSv4.1 slot leakage in the pNFS files driver - Stable fix for a slot leak in nfs40_sequence_done - Don't reject NFSv4 servers that support ACLs with only ALLOW aces -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJS7Bb+AAoJEGcL54qWCgDyDuQP/17nKR5e6MLhixcAbvlcH+pN 8CGolAM3HmRXDWUW/PkBH3UguG8Tzx1Ex26vIxipPeTSwZabf6194Twj6L97DEGZ 2SouD158BW1TkAbhEN/alKB/4ZCPos05iXjZkrL7MRff+8FD0UvWR2pBT1F2jQdY ZftG76Q72qhZHfH07ZMxM/v4Oy2Ge98RDD35gfuuqMSjHpmN9tiB55PeheW33LVY fu6I/JEwmlJpgy2qUcDv7v0V4mDpjC7XbcjjHpMHL8zp/C5Rx/rdgt9OQPlwmjdV FD8MWNXLc5TWxIouLDFPVUv3WZPjyu449QHS9Wc95fSqsHcdl4j4SwLAoSvUIdHt vDI5PtWhw3WAezbtiuCQnT0xcoIOn5bLjOVP13taDcV9vlZLcFlyOpZ5gVE4/Yju zm4sCW2+imDc74ERGa4fukF6QhzzAVmD8RlCJwuNzwCfXiZ36+xSanMYiPoUiwLL OVNgymrm0fe7GVFQKWN2D+Vr68OQEmARO+KfA3UzP5rQV+9CU8zSHjbcoRWZ59QG VahOS5WDLQSrMp8W37yAHH9IiAWveAAKJJTHlOniRqH90QYPgyW18fTo7YcpW313 AQGFgr/1n4t27MWRLu5rdoN5v8+kwNi0UV6oboNIPoP1v15NkEMvc7HKFj5M883R qEYfe5wqN/eRNj68NT/+ =B7f0 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix several races in nfs_revalidate_mapping - NFSv4.1 slot leakage in the pNFS files driver - Stable fix for a slot leak in nfs40_sequence_done - Don't reject NFSv4 servers that support ACLs with only ALLOW aces" * tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: initialize the ACL support bits to zero. NFSv4.1: Cleanup NFSv4.1: Clean up nfs41_sequence_done NFSv4: Fix a slot leak in nfs40_sequence_done NFSv4.1 free slot before resending I/O to MDS nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING NFS: Fix races in nfs_revalidate_mapping sunrpc: turn warn_gssd() log message into a dprintk() NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping nfs: handle servers that support only ALLOW ACE type.	2014-01-31 15:39:07 -08:00
Oleg Drokin	d22e6338db	Fix mountpoint reference leakage in linkat Recent changes to retry on ESTALE in linkat (commit `442e31ca5a`) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-31 17:33:13 -05:00
Christoph Hellwig	b168fff721	hfsplus: use xattr handlers for removexattr hfsplus was already using the handlers for get and set operations, and with the removal of can_set_xattr we've now allow operations that wouldn't otherwise be allowed. With this we can also centralize the special-casing of the osx. attrs that don't have prefixes on disk in the osx xattr handlers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-31 14:44:39 -05:00
Andrew Ruder	807612db2f	fs/super.c: sync ro remount after blocking writers Move sync_filesystem() after sb_prepare_remount_readonly(). If writers sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly() it can cause inodes to be dirtied and writeback to occur well after sys_mount() has completely successfully. This was spotted by corrupted ubifs filesystems on reboot, but appears that it can cause issues with any filesystem using writeback. Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> CC: Richard Weinberger <richard@nod.at> Co-authored-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-31 14:29:36 -05:00
Jeff Layton	9115eac2c7	vfs: unexport the getname() symbol Leaving getname() exported when putname() isn't is a bad idea. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-31 14:28:56 -05:00
Linus Torvalds	e30b82bbe0	Minor bug fix for linux-3.14 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJS67pQAAoJEDaohF61QIxkfkcQAKAuNhBtaGkx8Dd9gPM98DrW F4NnitUNzcjHTSq23oRIe7rLanXTMHKMhq4S83YWISLTWCAmekOVF4xs8WTEm1Px 5/X3an65m3mvW/1V7oxixA+GLUk2LwPqPt7N+jWApkR9KB5LydQCucdbq4Vza9gd SkoK5RJe8bTE+43k962vsDJM7XCVeUOGOSXO7MyTaipRZ+fyuEPwwbQ0u7Bxl9JG HkfhCJUu3V79nruIgmP1Z2llN2XIv/fsCo3F42yBspSBUxYqnpnAdo6Zr5srnEQC kHm67VFk5sPnLPnm6AxaoonE9GX+TMzy5eUk6Lb5x5EwRlH694FFovlquFyr3RjN cmh6VWGv1jcYnEmRhKjT8Vs0GQ+Q5L2L6dEdDbck2WkVWPDdLb83wbPJRZ/g5kcF GAUI8EjjSR4ezuZqmv6tbWHpn/q7p6NekC8DVMmwtoBZgIuFmITyrqpFJUG4acnO yuGdZtYePnMHZlQMNTdzUjuOu/Wyqit5OrM/ztmrGlYCI8eiE17OInWikZzSFFcf +uDodCg5tokG8AlNgNZ+thcpjO7aAR9GCOD2INZYKlFRkJANpQ+OOK+AOafmWoqO UjmL0rH3FXRu0uhD8VLKRaUfuyMaBHJ/9+ndwmyOa+7x0BPv03hw3bGGhy6tsTHg XbxlExsTrtpKEXwznWqG =F/GX -----END PGP SIGNATURE----- Merge tag 'jfs-3.14' of git://github.com/kleikamp/linux-shaggy Pull jfs fix from David Kleikamp: "Minor bug fix for linux-3.14" * tag 'jfs-3.14' of git://github.com/kleikamp/linux-shaggy: jfs: fix xattr value size overflow in __jfs_setxattr	2014-01-31 08:14:35 -08:00
Sage Weil	77516dc92a	ceph: fix missing dput in ceph_set_acl Add matching dput() for d_find_alias(). Move d_find_alias() down a bit at Julia's suggestion. [ Introduced by commit 72466d0b92e0: "ceph: fix posix ACL hooks" ] Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-31 08:14:06 -08:00
Sachin Prabhu	a9a315d414	cifs: Fix check for regular file in couldbe_mf_symlink() MF Symlinks are regular files containing content in a specified format. The function couldbe_mf_symlink() checks the mode for a set S_IFREG bit as a test to confirm that it is a regular file. This bit is also set for other filetypes and simply checking for this bit being set may return false positives. We ensure that we are actually checking for a regular file by using the S_ISREG macro to test instead. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reported-by: Neil Brown <neilb@suse.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <smfrench@gmail.com>	2014-01-31 09:06:43 -06:00
Malahal Naineni	a1800acaf7	nfs: initialize the ACL support bits to zero. Avoid returning incorrect acl mask attributes when the server doesn't support ACLs. Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-31 08:28:16 -05:00
Linus Torvalds	e7651b819e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a pretty big pull, and most of these changes have been floating in btrfs-next for a long time. Filipe's properties work is a cool building block for inheriting attributes like compression down on a per inode basis. Jeff Mahoney kicked in code to export filesystem info into sysfs. Otherwise, lots of performance improvements, cleanups and bug fixes. Looks like there are still a few other small pending incrementals, but I wanted to get the bulk of this in first" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits) Btrfs: fix spin_unlock in check_ref_cleanup Btrfs: setup inode location during btrfs_init_inode_locked Btrfs: don't use ram_bytes for uncompressed inline items Btrfs: fix btrfs_search_slot_for_read backwards iteration Btrfs: do not export ulist functions Btrfs: rework ulist with list+rb_tree Btrfs: fix memory leaks on walking backrefs failure Btrfs: fix send file hole detection leading to data corruption Btrfs: add a reschedule point in btrfs_find_all_roots() Btrfs: make send's file extent item search more efficient Btrfs: fix to catch all errors when resolving indirect ref Btrfs: fix protection between walking backrefs and root deletion btrfs: fix warning while merging two adjacent extents Btrfs: fix infinite path build loops in incremental send btrfs: undo sysfs when open_ctree() fails Btrfs: fix snprintf usage by send's gen_unique_name btrfs: fix defrag 32-bit integer overflow btrfs: sysfs: list the NO_HOLES feature btrfs: sysfs: don't show reserved incompat feature btrfs: call permission checks earlier in ioctls and return EPERM ...	2014-01-30 20:08:20 -08:00
Linus Torvalds	271bf66d4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull some further ceph acl cleanups from Sage Weil: "I do have a couple patches on top of what's in your tree, though, that clean up a couple duplicated lines in your fix and apply Christoph's cleanup" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: simplify ceph_{get,init}_acl ceph: remove duplicate declaration of ceph_setattr	2014-01-30 20:02:51 -08:00
Christoph Hellwig	7585823619	ceph: simplify ceph_{get,init}_acl - ->get_acl only gets called after we checked for a cached ACL, so no need to call get_cached_acl again. - no need to check IS_POSIXACL in ->get_acl, without that it should never get set as all the callers that set it already have the check. - you should be able to use the full posix_acl_create in CEPH Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Sage Weil <sage@inktank.com>	2014-01-30 19:26:17 -08:00
Linus Torvalds	53d8ab29f8	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...	2014-01-30 11:40:10 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Linus Torvalds	d9894c228b	Merge branch 'for-3.14' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: - Handle some loose ends from the vfs read delegation support. (For example nfsd can stop breaking leases on its own in a fewer places where it can now depend on the vfs to.) - Make life a little easier for NFSv4-only configurations (thanks to Kinglong Mee). - Fix some gss-proxy problems (thanks Jeff Layton). - miscellaneous bug fixes and cleanup * 'for-3.14' of git://linux-nfs.org/~bfields/linux: (38 commits) nfsd: consider CLAIM_FH when handing out delegation nfsd4: fix delegation-unlink/rename race nfsd4: delay setting current_fh in open nfsd4: minor nfs4_setlease cleanup gss_krb5: use lcm from kernel lib nfsd4: decrease nfsd4_encode_fattr stack usage nfsd: fix encode_entryplus_baggage stack usage nfsd4: simplify xdr encoding of nfsv4 names nfsd4: encode_rdattr_error cleanup nfsd4: nfsd4_encode_fattr cleanup minor svcauth_gss.c cleanup nfsd4: better VERIFY comment nfsd4: break only delegations when appropriate NFSD: Fix a memory leak in nfsd4_create_session sunrpc: get rid of use_gssp_lock sunrpc: fix potential race between setting use_gss_proxy and the upcall rpc_clnt sunrpc: don't wait for write before allowing reads from use-gss-proxy file nfsd: get rid of unused function definition Define op_iattr for nfsd4_open instead using macro NFSD: fix compile warning without CONFIG_NFSD_V3 ...	2014-01-30 10:18:43 -08:00
Christoph Hellwig	5f13ee9c1c	nfs: fix xattr inode op pointers when disabled Chris Mason reported a NULL pointer derefernence in generic_getxattr() that was due to sb->s_xattr being NULL. The reason is that the nfs #ifdef's for ACL support were misplaced, and the nfs3 inode operations had the xattr operation pointers set up, even though xattrs were not actually supported. As a result, the xattr code was being called without the infrastructure having been set up. Move the #ifdef's appropriately. Reported-and-tested-by: Chris Mason <clm@fb.com> Acked-by: Al Viro viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-30 09:37:49 -08:00
Peter Rosin	32d35d44d0	ceph: remove duplicate declaration of ceph_setattr Signed-off-by: Peter Rosin <peda@lysator.liu.se> Signed-off-by: Sage Weil <sage@inktank.com>	2014-01-30 08:38:00 -08:00
Linus Torvalds	13293115d1	Merge branch 'akpm' (patches from Andrew Morton) Merge random fixes from Andrew Morton: "Random fixes. I have one batch remaining for -rc1, mainly zram changes which await a merge of Jens's trees" * emailed patches fron Andrew Morton akpm@linux-foundation.org>: MAINTAINERS: ADI Linux development mailing lists: change to the new server Documentation: fix multiple typo occurences s/KenelVersion/KernelVersion/ dma-debug: fix overlap detection memblock: add limit checking to memblock_virt_alloc mm/readahead.c: fix do_readahead() for no readpage(s) mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pages slab: fix wrong retval on kmem_cache_create_memcg error path s390/compat: change parameter types from unsigned long to compat_ulong_t fs/compat: fix lookup_dcookie() parameter handling fs/compat: fix parameter handling for compat readv/writev syscalls mm/mempolicy.c: convert to pr_foo() mm: numa: initialise numa balancing after jump label initialisation mm/page-writeback.c: do not count anon pages as dirtyable memory mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory mm: document improved handling of swappiness==0 lib/genalloc.c: add check gen_pool_dma_alloc() if dma pointer is not NULL	2014-01-29 16:22:54 -08:00
Heiko Carstens	d8d14bd09c	fs/compat: fix lookup_dcookie() parameter handling Commit `d5dc77bfee` ("consolidate compat lookup_dcookie()") coverted all architectures to the new compat_sys_lookup_dcookie() syscall. The "len" paramater of the new compat syscall must have the type compat_size_t in order to enforce zero extension for architectures where the ABI requires that the caller of a function performed zero and/or sign extension to 64 bit of all parameters. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <stable@vger.kernel.org> [v3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-29 16:22:40 -08:00
Heiko Carstens	dfd948e32a	fs/compat: fix parameter handling for compat readv/writev syscalls We got a report that the pwritev syscall does not work correctly in compat mode on s390. It turned out that with commit `72ec35163f` ("switch compat readv/writev variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a couple of syscall parameters because the some parameter types haven't been converted from unsigned long to compat_ulong_t. This is needed for architectures where the ABI requires that the caller of a function performed zero and/or sign extension to 64 bit of all parameters. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <stable@vger.kernel.org> [v3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-29 16:22:39 -08:00
Linus Torvalds	1ecd7450c0	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fanotify use-after-free fixes from Jan Kara: "Three fixes for the fanotify use after free problems guys were reporting. I have ended up with different lifetime rules for struct fanotify_event_info depending on whether it is for permission event or normal event which isn't ideal. My plan is to split these into two different structures (as permission events need larger struct anyway) which will make the rules trivial again. But that can wait for later I guess (but I can add the patch to the pile if you want), now I wanted to make -rc1 boot for these guys" [ "These guys" being Jiri Kosina and Dave Jones that reported the slab corruption issues due to incorrect object lifetimes ] * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: Fix use after free for permission events fsnotify: Do not return merged event from fsnotify_add_notify_event() fanotify: Fix use after free in mask checking	2014-01-29 16:09:34 -08:00
Sage Weil	72466d0b92	ceph: fix posix ACL hooks The merge of commit `7221fe4c2e` ("ceph: add acl for cephfs") raced with upstream changes in the generic POSIX ACL code (eg commit `2aeccbe957` "fs: add generic xattr_acl handlers" and others). Some of the fallout was fixed in commit `4db658ea0c` ("ceph: Fix up after semantic merge conflict"), but it was incomplete: the set_acl inode_operation wasn't getting set, and the prototype needed to be adjusted a bit (it doesn't take a dentry anymore). Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-29 16:05:28 -08:00
Trond Myklebust	905e7dafbe	NFSv4.1: Cleanup It is now completely safe to call nfs41_sequence_free_slot with a NULL slot. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-29 12:26:57 -05:00
Trond Myklebust	a13ce7c629	NFSv4.1: Clean up nfs41_sequence_done Move the test for res->sr_slot == NULL out of the nfs41_sequence_free_slot helper and into the main function for efficiency. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-29 12:24:03 -05:00
Trond Myklebust	cab92c1982	NFSv4: Fix a slot leak in nfs40_sequence_done The check for whether or not we sent an RPC call in nfs40_sequence_done is insufficient to decide whether or not we are holding a session slot, and thus should not be used to decide when to free that slot. This patch replaces the RPC_WAS_SENT() test with the correct test for whether or not slot == NULL. Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-29 12:12:15 -05:00
Andy Adamson	f9c96fcc50	NFSv4.1 free slot before resending I/O to MDS Fix a dynamic session slot leak where a slot is preallocated and I/O is resent through the MDS. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-01-29 11:54:55 -05:00
Chris Mason	cf93da7bcf	Btrfs: fix spin_unlock in check_ref_cleanup Our goto out should have gone a little farther. Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:31 -08:00
Chris Mason	90d3e592e9	Btrfs: setup inode location during btrfs_init_inode_locked We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit changes things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org	2014-01-29 07:06:30 -08:00
Chris Mason	514ac8ad87	Btrfs: don't use ram_bytes for uncompressed inline items If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org	2014-01-29 07:06:29 -08:00
Filipe David Borba Manana	23c6bf6a91	Btrfs: fix btrfs_search_slot_for_read backwards iteration If the current path's leaf slot is 0, we do search for the previous leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a value corresponding to the number of items - 1 of the former leaf. Fix this by using the slot set by btrfs_prev_leaf, decrementing it by 1 if it's equal to the leaf's number of items. Use of btrfs_search_slot_for_read() for backward iteration is used in particular by the send feature, which could miss items when the input leaf has less items than its previous leaf. This could be reproduced by running btrfs/007 from xfstests in a loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:28 -08:00
Wang Shilong	49fc647a2c	Btrfs: do not export ulist functions There are not any users that use ulist except Btrfs,don't export them. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:27 -08:00
Wang Shilong	4c7a6f74ce	Btrfs: rework ulist with list+rb_tree We are really suffering from now ulist's implementation, some developers gave their try, and i just gave some of my ideas for things: 1. use list+rb_tree instead of arrary+rb_tree 2. add cur_list to iterator rather than ulist structure. 3. add seqnum into every node when they are added, this is used to do selfcheck when iterating node. I noticed Zach Brown's comments before, long term is to kick off ulist implementation, however, for now, we need at least avoid arrary from ulist. Cc: Liu Bo <bo.li.liu@oracle.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:27 -08:00
Wang Shilong	f05c474688	Btrfs: fix memory leaks on walking backrefs failure When walking backrefs, we may iterate every inode's extent and add/merge them into ulist, and the caller will free memory from ulist. However, if we fail to allocate inode's extents element memory or ulist_add() fail to allocate memory, we won't add allocated memory into ulist, and the caller won't free some allocated memory thus memory leaks happen. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:26 -08:00
Filipe David Borba Manana	bf54f412f0	Btrfs: fix send file hole detection leading to data corruption There was a case where file hole detection was incorrect and it would cause an incremental send to override a section of a file with zeroes. This happened in the case where between the last leaf we processed which contained a file extent item for our current inode and the leaf we're currently are at (and has a file extent item for our current inode) there are only leafs containing exclusively file extent items for our current inode, and none of them was updated since the previous send operation. The file hole detection code would incorrectly consider the file range covered by these leafs as a hole. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:25 -08:00
Wang Shilong	bca1a29003	Btrfs: add a reschedule point in btrfs_find_all_roots() I can easily trigger the following warnings when enabling quota in my virtual machine(running Opensuse), Steps are firstly creating a subvolume full of fragment extents, and then create many snapshots (500 in my test case). [ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970] [ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000 [ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0 [ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs] [ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0 [ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c [ 2362.809054] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0 [ 2362.809099] Stack: [ 2362.809100] 00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001 [ 2362.809105] 99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001 [ 2362.809109] f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023 [ 2362.809113] Call Trace: [ 2362.809136] [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs] [ 2362.809156] [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs] [ 2362.809174] [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs] [ 2362.809180] [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40 [ 2362.809199] [<fa3648df>] worker_loop+0xff/0x470 [btrfs] [ 2362.809204] [<c027a88a>] ? __wake_up_locked+0x1a/0x20 [ 2362.809221] [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs] [ 2362.809225] [<c025ebbc>] kthread+0x9c/0xb0 [ 2362.809229] [<c06b487b>] ret_from_kernel_thread+0x1b/0x30 [ 2362.809233] [<c025eb20>] ? kthread_create_on_node+0x110/0x110 By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer hit these warnings. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:25 -08:00
Filipe David Borba Manana	7fdd29d02e	Btrfs: make send's file extent item search more efficient Instead of looking for a file extent item, process it, release the path and do a btree search for the next file extent item, just process all file extent items in a leaf without intermediate btree searches. This way we save cpu and we're not blocking other tasks or affecting concurrency on the btree, because send's paths use the commit root and skip btree node/leaf locking. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:24 -08:00
Wang Shilong	95def2ede1	Btrfs: fix to catch all errors when resolving indirect ref We can only tolerate ENOENT here, for other errors, we should return directly. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:23 -08:00
Wang Shilong	538f72cdf0	Btrfs: fix protection between walking backrefs and root deletion There is a race condition between resolving indirect ref and root deletion, and we should gurantee that root can not be destroyed to avoid accessing broken tree here. Here we fix it by holding @subvol_srcu, and we will release it as soon as we have held root node lock. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:23 -08:00
Gui Hecheng	3c9665df0c	btrfs: fix warning while merging two adjacent extents When we have two adjacent extents in relink_extent_backref, we try to merge them. When we use btrfs_search_slot to locate the slot for the current extent, we shouldn't set "ins_len = 1", because we will merge it into the previous extent rather than insert a new item. Otherwise, we may happen to create a new leaf in btrfs_search_slot and path->slot[0] will be 0. Then we try to fetch the previous item using "path->slots[0]--", and it will cause a warning as follows: [ 145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0 [ 145.713387] btrfs bad mapping eb start `5337088` len 4096, wanted 167772306 8 ... [ 145.713462] [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0 [ 145.713476] [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40 [ 145.713485] [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0 [ 145.713498] [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820 [ 145.713508] [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80 I encounter this warning when running defrag having mkfs.btrfs with option -M. At the same time there are read/writes & snapshots running at background. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:22 -08:00
Filipe David Borba Manana	9f03740a95	Btrfs: fix infinite path build loops in incremental send The send operation processes inodes by their ascending number, and assumes that any rename/move operation can be successfully performed (sent to the caller) once all previous inodes (those with a smaller inode number than the one we're currently processing) were processed. This is not true when an incremental send had to process an hierarchical change between 2 snapshots where the parent-children relationship between directory inodes was reversed - that is, parents became children and children became parents. This situation made the path building code go into an infinite loop, which kept allocating more and more memory that eventually lead to a krealloc warning being displayed in dmesg: WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0() Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007 [ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0 [ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0 [ 5381.660457] Call Trace: [ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68 [ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0 [ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20 [ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0 [ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440 [ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0 [ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50 [ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100 [ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230 [ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe [ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0 [ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs] [ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs] [ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs] Even without this loop, the incremental send couldn't succeed, because it would attempt to send a rename/move operation for the lower inode before the highest inode number was renamed/move. This issue is easy to trigger with the following steps: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ mkdir /mnt/btrfs/a/b/c2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send The structure of the filesystem when the first snapshot is taken is: . (ino 256) \|-- a (ino 257) \|-- b (ino 258) \|-- c (ino 259) \| \|-- d (ino 260) \| \|-- c2 (ino 261) And its structure when the second snapshot is taken is: . (ino 256) \|-- a (ino 257) \|-- b (ino 258) \|-- c2 (ino 261) \|-- d2 (ino 260) \|-- cc (ino 259) Before the move/rename operation is performed for the inode 259, the move/rename for inode 260 must be performed, since 259 is now a child of 260. A test case for xfstests, with a more complex scenario, will follow soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:22 -08:00
Jan Kara	8581679424	fanotify: Fix use after free for permission events Currently struct fanotify_event_info has been destroyed immediately after reporting its contents to userspace. However that is wrong for permission events because those need to stay around until userspace provides response which is filled back in fanotify_event_info. So change to code to free permission events only after we have got the response from userspace. Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Reported-and-tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Jan Kara <jack@suse.cz>	2014-01-29 13:57:17 +01:00
Jan Kara	83c0e1b442	fsnotify: Do not return merged event from fsnotify_add_notify_event() The event returned from fsnotify_add_notify_event() cannot ever be used safely as the event may be freed by the time the function returns (after dropping notification_mutex). So change the prototype to just return whether the event was added or merged into some existing event. Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Reported-and-tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Jan Kara <jack@suse.cz>	2014-01-29 13:57:10 +01:00
Jan Kara	13116dfd13	fanotify: Fix use after free in mask checking We cannot use the event structure returned from fsnotify_add_notify_event() because that event can be freed by the time that function returns. Use the mask argument passed into the event handler directly instead. This also fixes a possible problem when we could unnecessarily wait for permission response for a normal fanotify event which got merged with a permission event. We also disallow merging of permission event with any other event so that we know the permission event which we just created is the one on which we should wait for permission response. Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Reported-and-tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Jan Kara <jack@suse.cz>	2014-01-29 13:57:04 +01:00
Linus Torvalds	0e47c969c6	MTD updates for 3.14: - Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get David's "ack" for this; I'm sure he approves, but he's been pretty silent lately) - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver - Last (?) round of pxa3xx improvements for supporting Armada 370/XP - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_, etc.) - Quad read mode support for SPI NOR driver (m25p80) - Update Davinci NAND driver to prepare for use on new platforms - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending - Miscellaneous NAND device support (new IDs) - Add READ RETRY support for Micron MLC NAND - Support new GPMI NAND ECC layout device-tree binding - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAABAgAGBQJS50wMAAoJEFySrpd9RFgt1I0P/3fxzmHMmTRPjNRmtVqE5buW L3CjI1sPypA5PgCjjwABQdiw8WztBt6KVlMv5Gs6gaoi/WYAyAWz40PfwuBWFklD L8+cuKuuslsFfqh+gdkrBkNKnrFfMJpTDx57E5+dciInmWWB1w14sHMAYk5A/ZHK MBfXCcgfIECeQp1zSKmBtQxkhqn20hOS232gsCRWu2NWWu4UCzEcRLHGCBP1qqi3 6joQhrICiQJwH09D+I9YfyS/oisM+75Df+79xephcjMaJYAAE3tLVrKM84CBso+R R/wjOvX+xFVyvf3n+CftxbmbFVXtRAWqCEEHu2yKffiF9oipp9xOm6gqJb9SYX/7 n54fexx5cM1AE8OeDCxgbVCNfgoqDS6hk03d5oVMRmASShr0Ye4irG77Dy/KTMGC UdB2gb5HoKPC2N2IWWRlpwRoyLR9Xhf/+iCHjac0kB7oYQ0IdMyC0d4h7Prp5r/4 zaz/pBNNr1lL8DOFrIog02U1DL7jOBtMmlaCUnfrPkXOiwG04Inkfy7Gfug0uTIh o1JhwCWmrY/CLYmpMqKFK9V4gr72CZTGIjMeNpb2NUAM0XlRbf3AQ8x7P7Q9UjK6 ARZC/3lPrkSE8BDgSP8cT/oXB5zsIAU0jJbu3yIixD/v9WwMyKjoS4E8crl7qVB4 KDAzQNR3rjwmE9lOjtsx =3nT9 -----END PGP SIGNATURE----- Merge tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd Pull MTD updates from Brian Norris: - Add me (Brian Norris) as an additional MTD maintainer (it'd be nice to get David's "ack" for this; I'm sure he approves, but he's been pretty silent lately) - Add Ezequiel Garcie as maintainer for the pxa3xx NAND driver - Last (?) round of pxa3xx improvements for supporting Armada 370/XP - Typical churn in driver boilerplate (OOM messages, printk()'s, devm_, etc.) - Quad read mode support for SPI NOR driver (m25p80) - Update Davinci NAND driver to prepare for use on new platforms - Begin to kill off NAND_MAX_{PAGE,OOB}SIZE macros; more work is pending - Miscellaneous NAND device support (new IDs) - Add READ RETRY support for Micron MLC NAND - Support new GPMI NAND ECC layout device-tree binding - Avoid mapping stack/vmalloc() memory for GPMI NAND DMA * tag 'for-linus-20140127' of git://git.infradead.org/linux-mtd: (151 commits) mtd: gpmi: add sanity check when mapping DMA for read_buf/write_buf mtd: gpmi: allocate a proper buffer for non ECC read/write mtd: m25p80: Set rx_nbits for Quad SPI transfers mtd: m25p80: Enable Quad SPI read transfers for s25fl512s mtd: s3c2410: Merge plat/regs-nand.h into s3c2410.c mtd: mtdram: add missing 'const' mtd: m25p80: assign default read command mtd: nuc900_nand: remove redundant return value check of platform_get_resource() mtd: plat_nand: remove redundant return value check of platform_get_resource() mtd: nand: add Intel manufacturer ID mtd: nand: add SanDisk manufacturer ID mtd: nand: add support for Samsung K9LCG08U0B mtd: nand: pxa3xx: Add support for 2048 bytes page size devices mtd: m25p80: Use OPCODE_QUAD_READ_4B for 4-byte addressing mtd: nand: don't use {read,write}_buf for 8-bit transfers mtd: nand: use __packed shorthand mtd: nand: support Micron READ RETRY mtd: nand: add generic READ RETRY support mtd: nand: add ONFI vendor block for Micron mtd: nand: localize ECC failures per page ...	2014-01-28 18:56:37 -08:00
Linus Torvalds	f1499382f1	xfs: update #2 for v3.14-rc1 - allow logical sector sized direct io on 'advanced format' 4k/512 disk. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJS6C5lAAoJENaLyazVq6ZOFFMP/Ag3IzyGv9zQ1I1ebqxdGNmU 8bNVvurN+Az1ISgXkMdcbqNgpVxcWaAcpbJwxWsRBaKBxtBmBWa0TF33XEmrFdmO Qqw3qLSsPzIDP5l518/bpn1Q0XhMWO2zss9HtHogw/7WBkB5qQYn94DAwIF1Zrga cimpuiVv7u73BnlMJeQl+mqONkuEUl9Oc9lOyDP7lqbAlp3mBDylTs0q99kug8fF LSQ+f+06PFKshPzTtcZHhqwMtK7Pd+wPp3B3i0iGvv1GcWWL1adM5HfWPLLQ/85l CvdTnKnIWpNvmaEbPtpB+kuz7V3CtnLFIBM1oRiYZoNRNEUG3SkkqH+xSkXY+Wfr Wiuplt50vHUvn1tiL6n2GE0QAyTNhcy2RiWzpkIim6S2LUXCmdOKvT9hEEs5ymVl Y+VSXwdPEbCHSfhqkg+PXUbWxk+WTkDe9rOMFK4VcTjDuxiWUAiFgVpEHrxjPV4A rFPwNgfUpaRMou5SzTZHLuM8YQ75DckDH+O3XpWD4boEfeTGeV3LhlRzLvIkL0I2 8VB24kV9TPUj+rphh9AckdU3kfN1wIb7ZOjwr008I5udfgPwjoj+kSgnd+k2yPtr r/ryHTV+RchbsiYEdriibSgUOtJNdNWnboba1DEDZDSvqALsq5eKWmpm3InE85Ix 2bN0SdjsTphp/Mc1Mllg =c0tN -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs Pull second xfs update from Ben Myers: "Allow logical sector sized direct io on 'advanced format' 4k/512 disk" * tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: allow logical-sector sized O_DIRECT xfs: rename xfs_buftarg structure members xfs: clean up xfs_buftarg	2014-01-28 18:21:22 -08:00
Linus Torvalds	4db658ea0c	ceph: Fix up after semantic merge conflict The previous ceph-client merge resulted in ceph not even building, because there was a merge conflict that wasn't visible as an actual data conflict: commit `7221fe4c2e` ("ceph: add acl for cephfs") added support for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change a lot of the POSIX ACL helper functions to be much more helpful to filesystems (see for example commits `2aeccbe957` "fs: add generic xattr_acl handlers", `5bf3258fd2` "fs: make posix_acl_chmod more useful" and `37bc15392a` "fs: make posix_acl_create more useful") The reason this conflict wasn't obvious was many-fold: because it was a semantic conflict rather than a data conflict, it wasn't visible in the git merge as a conflict. And because the VFS tree hadn't been in linux-next, people hadn't become aware of it that way. And because I was at jury duty this morning, I was using my laptop and as a result not doing constant "allmodconfig" builds. Anyway, this fixes the build and generally removes a fair chunk of the Ceph POSIX ACL support code, since the improved helpers seem to match really well for Ceph too. But I don't actually have any way to test the end result, and I was really hoping for some ACK's for this. Oh, well. Not compiling certainly doesn't make things easier to test, so I'm committing this without the acks after having waited for four hours... Plus it's what I would have done for the merge had I noticed the semantic conflict.. Reported-by: Dave Jones <davej@redhat.com> Cc: Sage Weil <sage@inktank.com> Cc: Guangliang Zhao <lucienchao@gmail.com> Cc: Li Wang <li.wang@ubuntykylin.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-01-28 18:06:18 -08:00
Anand Jain	2365dd3ca0	btrfs: undo sysfs when open_ctree() fails reproducer: mkfs.btrfs -f /dev/sdb &&\ mount /dev/sdb /btrfs &&\ btrfs dev add -f /dev/sdc /btrfs &&\ umount /btrfs &&\ wipefs -a /dev/sdc &&\ mount -o degraded /dev/sdb /btrfs //above mount fails so try with RO mount -o degraded,ro /dev/sdb /btrfs ------ sysfs: cannot create duplicate filename '/fs/btrfs/3f48c79e-5ed0-4e87-b189-86e749e503f4' :: dump_stack+0x49/0x5e warn_slowpath_common+0x87/0xb0 warn_slowpath_fmt+0x41/0x50 strlcat+0x69/0x80 sysfs_warn_dup+0x87/0xa0 sysfs_add_one+0x40/0x50 create_dir+0x76/0xc0 sysfs_create_dir_ns+0x7a/0xc0 kobject_add_internal+0xad/0x220 kobject_add_varg+0x38/0x60 kobject_init_and_add+0x53/0x70 mutex_lock+0x11/0x40 __free_pages+0x25/0x30 free_pages+0x41/0x50 selinux_sb_copy_data+0x14e/0x1e0 mount_fs+0x3e/0x1a0 vfs_kern_mount+0x71/0x120 do_mount+0x3f7/0x980 SyS_mount+0x8b/0xe0 system_call_fastpath+0x16/0x1b ------ further 'modprobe -r btrfs' fails as well Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:44 -08:00
Filipe David Borba Manana	f74b86d855	Btrfs: fix snprintf usage by send's gen_unique_name The buffer size argument passed to snprintf must account for the trailing null byte added by snprintf, and it returns a value >= then sizeof(buffer) when the string can't fit in the buffer. Since our buffer has a size of 64 characters, and the maximum orphan name we can generate is 63 characters wide, we must pass 64 as the buffer size to snprintf, and not 63. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:43 -08:00
Justin Maggard	c41570c9d2	btrfs: fix defrag 32-bit integer overflow When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:43 -08:00
David Sterba	c736c095de	btrfs: sysfs: list the NO_HOLES feature Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:42 -08:00
David Sterba	66b4bbd4f5	btrfs: sysfs: don't show reserved incompat feature The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit is only reserved, no point to list it in sysfs. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:41 -08:00
David Sterba	bd60ea0fe9	btrfs: call permission checks earlier in ioctls and return EPERM The owner and capability checks in IOC_SUBVOL_SETFLAGS and SET_RECEIVED_SUBVOL should be called before any other checks are done. Also unify the error code to EPERM. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:41 -08:00
David Sterba	d024206133	btrfs: restrict snapshotting to own subvolumes Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:40 -08:00
Miao Xie	89d4346a36	Btrfs: fix wrong block group in trace during the free space allocation We allocate the free space from the former block group, not the current one, so should use the former one to output the trace information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:40 -08:00
Miao Xie	215a63d139	Btrfs: cleanup the code of used_block_group in find_free_extent() used_block_group is just used for the space cluster which doesn't belong to the current block group, the other place needn't use it. Or the logic of code seems unclear. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:39 -08:00
Miao Xie	920e4a58d2	Btrfs: cleanup the redundant code for the block group allocation and init Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:38 -08:00
Miao Xie	26b47ff65b	Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss It is better that the position of the lock is close to the data which is protected by it, because they may be in the same cache line, we will load less cache lines when we access them. So we rearrange the members' position of btrfs_space_info structure to make the lock be closer to the its data. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:38 -08:00
Wang Shilong	ffcfaf8179	Btrfs: fix wrong search path initialization before searching tree root To search tree root without transaction protection, we should neither search commit root nor skip locking here, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:37 -08:00
Miao Xie	23c671a588	Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum The performance of fsync dropped down suddenly sometimes, the main reason of this problem was that we might only flush part dirty pages in a ordered extent, then got that ordered extent, wait for the csum calcucation. But if no task flushed the left part, we would wait until the flusher flushed them, sometimes we need wait for several seconds, it made the performance drop down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s) This patch improves the above problem by flushing left dirty pages aggressively. Test Environment: CPU: 2CPU * 2Cores Memory: 4GB Partition: 20GB(HDD) Test Command: # sysbench --num-threads=8 --test=fileio --file-num=1 \ > --file-total-size=8G --file-block-size=32768 \ > --file-io-mode=sync --file-fsync-freq=100 \ > --file-fsync-end=no --max-requests=10000 \ > --file-test-mode=rndwr run Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:37 -08:00
Wang Shilong	2c21b4d733	Btrfs: fix transaction abortion when remounting btrfs from RW to RO Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt -o flushoncommit # dd if=/dev/zero of=/mnt/data bs=4k count=102400 & # mount /dev/sda8 /mnt -o remount, ro When remounting RW to RO, the logic is to firstly set flag to RO and then commit transaction, however with option flushoncommit enabled,we will do RO check within committing transaction, so we get a transaction abortion here. Actually,here check is wrong, we should check if FS_STATE_ERROR is set, fix it. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:36 -08:00
Filipe David Borba Manana	e4355f34ef	Btrfs: faster file extent item search in clone ioctl When we are looking for file extent items that intersect the cloning range, for each one that falls completely outside the range, don't release the path and do another full tree search - just move on to the next slot and copy the file extent item into our buffer only if the item intersects the cloning range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:35 -08:00
Liu Bo	1a4319cc3c	Btrfs: fix extent state leak on transaction abortion When transaction is aborted, we fail to commit transaction, instead we do cleanup work. After that when we umount btrfs, we get to free fs roots' log trees respectively, but that happens after we unpin extents, so those extents pinned by freeing log trees will remain in memory and lead to the leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:35 -08:00
Qu Wenruo	078025347c	btrfs: Cleanup the btrfs_parse_options for remount. Since remount will pending the new mount options to the original mount options, which will make btrfs_parse_options check the old options then new options, causing some stupid output like "enabling XXX" following by "disable XXX". This patch will add extra check before every btrfs_info to skip the output from old options checking. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:34 -08:00
Qu Wenruo	3818aea275	btrfs: Add noinode_cache mount option Add noinode_cache mount option for btrfs. Since inode map cache involves all the btrfs_find_free_ino/return_ino things and if just trigger the mount_opt, an inode number get from inode map cache will not returned to inode map cache. To keep the find and return inode both in the same behavior, a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea. CHANGE_INODE_CACHE is set/cleared in remounting, and the original INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a success transaction. Since find/return inode is all done between btrfs_start_transaction and btrfs_commit_transaction, this will keep consistent behavior. Also noinode_cache mount option will not stop the caching_kthread. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:33 -08:00
Wang Shilong	ade2e0b3ee	Btrfs: fix to search previous metadata extent item since skinny metadata There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:33 -08:00
Wang Shilong	7c76edb77c	Btrfs: fix missing skinny metadata check in scrub_stripe() Check if we support skinny metadata firstly and fix to use right type to search. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:32 -08:00
Filipe David Borba Manana	28e5dd8f35	Btrfs: fix send to not send non-aligned clone operations It is possible for the send feature to send clone operations that request a cloning range (offset + length) that is not aligned with the block size. This makes the btrfs receive command send issue a clone ioctl call that will fail, as the ioctl will return an -EINVAL error because of the unaligned range. Fix this by not sending clone operations for non block aligned ranges, and instead send regular write operation for these (less common) cases. The following xfstest reproduces this issue, which fails on the second btrfs receive command without this change: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=`mktemp -d` status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $tmp } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root rm -f $seqres.full _scratch_mkfs >/dev/null 2>&1 _scratch_mount $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 \| \ _filter_scratch $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 \| \ _filter_scratch $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 \| _filter_scratch $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ -f $tmp/2.snap 2>&1 \| _filter_scratch md5sum $SCRATCH_MNT/foo \| _filter_scratch md5sum $SCRATCH_MNT/mysnap1/foo \| _filter_scratch md5sum $SCRATCH_MNT/mysnap2/foo \| _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV _scratch_mkfs >/dev/null 2>&1 _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap md5sum $SCRATCH_MNT/mysnap1/foo \| _filter_scratch $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap md5sum $SCRATCH_MNT/mysnap2/foo \| _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV status=0 exit The tests expected output is: QA output created by 025 FSSync 'SCRATCH_MNT' FSSync 'SCRATCH_MNT' wrote 2978/2978 bytes at offset 1482752 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1' FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2' At subvol SCRATCH_MNT/mysnap1 At subvol SCRATCH_MNT/mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo At subvol mysnap1 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo At snapshot mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:32 -08:00
Filipe David Borba Manana	14a958e678	Btrfs: fix btrfs boot when compiled as built-in After the change titled "Btrfs: add support for inode properties", if btrfs was built-in the kernel (i.e. not as a module), it would cause a kernel panic, as reported recently by Fengguang: [ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] PGD 0 [ 2.028684] Oops: 0000 [#1] SMP [ 2.028684] Modules linked in: [ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1 [ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000 [ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246 [ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000 [ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe [ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20 [ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff [ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000 [ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000 [ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0 [ 2.028684] Stack: [ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05 [ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05 [ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404 [ 2.028684] Call Trace: [ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a [ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d [ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230 [ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88 [ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89 [ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109 The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs) started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as part of the properties initialization routine), the libcrc32c is not yet initialized, so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel panic on boot. The approach to fix this is to use crypto component directly to use its crc32c (which is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is doing as well, it uses crypto directly to get crc32c functionality. Verified this works both when btrfs is built-in and when it's loadable kernel module. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:31 -08:00
Filipe David Borba Manana	c57c2b3ed2	Btrfs: unlock inodes in correct order in clone ioctl In the clone ioctl, when the source and target inodes are different, we can acquire their mutexes in 2 possible different orders. After we're done cloning, we were releasing the mutexes always in the same order - the most correct way of doing it is to release them by the reverse order they were acquired. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:30 -08:00
Wang Shilong	f499e40fd9	Btrfs: optimize to remove unnecessary removal with ulist reallocation Here we are not going to free memory, no need to remove every node one by one, just init root node here is ok. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:30 -08:00
Liu Bo	de6e820066	Btrfs: release subvolume's block_rsv before transaction commit We don't have to keep subvolume's block_rsv during transaction commit, and within transaction commit, we may also need the free space reclaimed from this block_rsv to process delayed refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:29 -08:00
Miao Xie	f1de968376	Btrfs: fix the race between write back and nocow buffered write When we ran the 274th case of xfstests with nodatacow mount option, We met the following warning message: WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0 It is caused by the race between the write back and nocow buffered write: Task1 Task2 __btrfs_buffered_write() skip data reservation reserve the metadata space copy the data dirty the pages unlock the pages write back the pages release the data space becasue there is no noreserve flag set the noreserve flag This patch fixes this problem by unlocking the pages after the noreserve flag is set. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:28 -08:00
Josef Bacik	7ef81ac86c	Btrfs: only process as many file extents as there are refs The backref walking code will search down to the key it is looking for and then proceed to walk _all_ of the extents on the file until it hits the end. This is suboptimal with large files, we only need to look for as many extents as we have references for that inode. I have a testcase that creates a randomly written 4 gig file and before this patch it took 6min 30sec to do the initial send, with this patch it takes 2min 30sec to do the intial send. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:28 -08:00
Josef Bacik	3a6d75e846	Btrfs: fix qgroup rescan to work with skinny metadata Could have sworn I fixed this before but apparently not. This makes us pass btrfs/022 with skinny metadata enabled. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:27 -08:00
Josef Bacik	580f0a678e	Btrfs: fix extent_from_logical to deal with skinny metadata I don't think this is an issue and I've not seen it in practice but extent_from_logical will fail to find a skinny extent because it uses btrfs_previous_item and gives it the normal extent item type. This is just not a place to use btrfs_previous_item since we care about either normal extents or skinny extents, so open code btrfs_previous_item to properly check. This would only affect metadata and the only place this is used for metadata is scrub and I'm pretty sure it's just for printing stuff out, not actually doing any work so hopefully it was never a problem other than a cosmetic one. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:27 -08:00
Josef Bacik	0a2b2a844a	Btrfs: throttle delayed refs better On one of our gluster clusters we noticed some pretty big lag spikes. This turned out to be because our transaction commit was taking like 3 minutes to complete. This is because we have like 30 gigs of metadata, so our global reserve would end up being the max which is like 512 mb. So our throttling code would allow a ridiculous amount of delayed refs to build up and then they'd all get run at transaction commit time, and for a cold mounted file system that could take up to 3 minutes to run. So fix the throttling to be based on both the size of the global reserve and how long it takes us to run delayed refs. This patch tracks the time it takes to run delayed refs and then only allows 1 seconds worth of outstanding delayed refs at a time. This way it will auto-tune itself from cold cache up to when everything is in memory and it no longer has to go to disk. This makes our transaction commits take much less time to run. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:26 -08:00
Josef Bacik	d7df2c796d	Btrfs: attach delayed ref updates to delayed ref heads Currently we have two rb-trees, one for delayed ref heads and one for all of the delayed refs, including the delayed ref heads. When we process the delayed refs we have to hold onto the delayed ref lock for all of the selecting and merging and such, which results in quite a bit of lock contention. This was solved by having a waitqueue and only one flusher at a time, however this hurts if we get a lot of delayed refs queued up. So instead just have an rb tree for the delayed ref heads, and then attach the delayed ref updates to an rb tree that is per delayed ref head. Then we only need to take the delayed ref lock when adding new delayed refs and when selecting a delayed ref head to process, all the rest of the time we deal with a per delayed ref head lock which will be much less contentious. The locking rules for this get a little more complicated since we have to lock up to 3 things to properly process delayed refs, but I will address that problem later. For now this passes all of xfstests and my overnight stress tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:25 -08:00
Josef Bacik	5039eddc19	Btrfs: make fsync latency less sucky Looking into some performance related issues with large amounts of metadata revealed that we can have some pretty huge swings in fsync() performance. If we have a lot of delayed refs backed up (as you will tend to do with lots of metadata) fsync() will wander off and try to run some of those delayed refs which can result in reading from disk and such. Since the actual act of fsync() doesn't create any delayed refs there is no need to make it throttle on delayed ref stuff, that will be handled by other people. With this patch we get much smoother fsync performance with large amounts of metadata. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:25 -08:00
Filipe David Borba Manana	63541927c8	Btrfs: add support for inode properties This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: ) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; ) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:24 -08:00
Filipe David Borba Manana	1acae57b16	Btrfs: faster file extent item replace operations When writing to a file we drop existing file extent items that cover the write range and then add a new file extent item that represents that write range. Before this change we were doing a tree lookup to remove the file extent items, and then after we did another tree lookup to insert the new file extent item. Most of the time all the file extent items we need to drop are located within a single leaf - this is the leaf where our new file extent item ends up at. Therefore, in this common case just combine these 2 operations into a single one. By avoiding the second btree navigation for insertion of the new file extent item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf COW operations, CPU time on btree node/leaf key binary searches, etc. Besides for file writes, this is an operation that happens for file fsync's as well. However log btrees are much less likely to big as big as regular fs btrees, therefore the impact of this change is smaller. The following benchmark was performed against an SSD drive and a HDD drive, both for random and sequential writes: sysbench --test=fileio --file-num=4096 --file-total-size=8G \ --file-test-mode=[rndwr\|seqwr] --num-threads=512 \ --file-block-size=8192 \ --max-requests=1000000 \ --file-fsync-freq=0 --file-io-mode=sync [prepare\|run] All results below are averages of 10 runs of the respective test. SSD sequential writes Before this change: 225.88 Mb/sec After this change: 277.26 Mb/sec SSD random writes Before this change: 49.91 Mb/sec After this change: 56.39 Mb/sec HDD sequential writes Before this change: 68.53 Mb/sec After this change: 69.87 Mb/sec HDD random writes Before this change: 13.04 Mb/sec After this change: 14.39 Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:23 -08:00
Wang Shilong	90515e7f5d	Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Cc: stable@vger.kernel.org Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:23 -08:00
Wang Shilong	8e56338d7d	Btrfs: remove unnecessary transaction commit before send We will finish orphan cleanups during snapshot, so we don't have to commit transaction here. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:22 -08:00
Wang Shilong	18f687d538	Btrfs: fix protection between send and root deletion We should gurantee that parent and clone roots can not be destroyed during send, for this we have two ideas. 1.by holding @subvol_sem, this might be a nightmare, because it will block all subvolumes deletion for a long time. 2.Miao pointed out we can reuse @send_in_progress, that mean we will skip snapshot deletion if root sending is in progress. Here we adopt the second approach since it won't block other subvolumes deletion for a long time. Besides in btrfs_clean_one_deleted_snapshot(), we only check first root , if this root is involved in send, we return directly rather than continue to check.There are several reasons about it: 1.this case happen seldomly. 2.after sending,cleaner thread can continue to drop that root. 3.make code simple Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:21 -08:00
Wang Shilong	896c14f97f	Btrfs: fix wrong send_in_progress accounting Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt # btrfs sub snapshot -r /mnt /mnt/snap1 # btrfs sub snapshot -r /mnt /mnt/snap2 # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1 # dmesg The problem is that we will sort clone roots(include @send_root), it might push @send_root before thus @send_root's @send_in_progress will be decreased twice. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:21 -08:00
Qu Wenruo	a88998f291	btrfs: Add treelog mount option. Add treelog mount option to enable tree log with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:20 -08:00
Qu Wenruo	d399167d88	btrfs: Add datasum mount option. Add datasum mount option to enable checksum with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:20 -08:00
Qu Wenruo	a258af7a3e	btrfs: Add datacow mount option. Add datacow mount option to enable copy-on-write with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:19 -08:00
Qu Wenruo	bd0330ad21	btrfs: Add acl mount option. Add acl mount option to enable acl with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:19 -08:00
Qu Wenruo	2c9ee85671	btrfs: Add noflushoncommit mount option. Add noflushoncommit mount option to disable flush on commit with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:18 -08:00
Qu Wenruo	5303629343	btrfs: Add noenospc_debug mount option. Add noenospc_debug mount option to disable ENOSPC debug with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:17 -08:00
Qu Wenruo	e07a2ade44	btrfs: Add nodiscard mount option. Add nodiscard mount option to disable discard with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:17 -08:00
Qu Wenruo	fc0ca9af18	btrfs: Add noautodefrag mount option. Btrfs has autodefrag mount option but no pairing noautodefrag option, which makes it impossible to disable autodefrag without umount. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:16 -08:00
Qu Wenruo	842bef5891	btrfs: Add "barrier" option to support "-o remount,barrier" Btrfs can be remounted without barrier, but there is no "barrier" option so nobody can remount btrfs back with barrier on. Only umount and mount again can re-enable barrier.(Quite awkward) Also the mount options in the document is also changed slightly for the further pairing options changes. Reported-by: Daniel Blueman <daniel@quora.org> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:16 -08:00
Wang Shilong	e8117c26b2	Btrfs: only fua the first superblock when writting supers We only intent to fua the first superblock in every device from comments, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:15 -08:00
Liu Bo	17504584f5	Btrfs: return free space to global_rsv as much as possible @full is not protected within global_rsv.lock, so we may think global_rsv is already full but in fact it's not, so we miss the opportunity to return free space to global_rsv directly when we release other block_rsvs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:14 -08:00
Wang Shilong	1708cc5723	Btrfs: fix an oops when we fail to relocate tree blocks During balance test, we hit an oops: [ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174! The problem is that if we fail to relocate tree blocks, we should update backref cache, otherwise, some pending nodes are not updated while snapshot check @cache->last_trans is within one transaction and won't update it and then oops happen. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:14 -08:00
Miao Xie	e77751aad1	Btrfs: fix the wrong nocow range check The following warning message was outputed when running the 274th case of xfstests with nodatacow option: BUG: Bad page state in process kswapd0 pfn:1c66f page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000 page flags: 0x1000000000100a(error\|uptodate\|private_2) It is because the check of nocow range was wrong, we should compare the start and end position of the extent with the write position to verify if the write position was in the extent, but the current code just used the start postion to do the check, so we got the wrong extent and told the caller that it was a nocow write. And then when we write back the dirty pages, we found we should cow the extent, but at that time, there was no space in the fs, we had to the error flag for the page. When someone reclaimed that page, the above warning outputed. Fix it. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:13 -08:00
Wang Shilong	25e293c2a2	Btrfs: fix an oops when we fail to merge reloc roots Previously, we will free reloc root memory and then force filesystem to be readonly. The problem is that there may be another thread commiting transaction which will try to access freed reloc root during merging reloc roots process. To keep consistency snapshots shared space, we should allow snapshot finished if possible, so here we don't free reloc root memory. signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:13 -08:00
Wang Shilong	dc4103f933	Btrfs: remove unused argument from select_reloc_root() @nr is no longer used, remove it from select_reloc_root() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:12 -08:00
Filipe David Borba Manana	eb653de159	Btrfs: reduce btree node locking duration on item update If we do a btree search with the goal of updating an existing item without changing its size (ins_len == 0 and cow == 1), then we never need to hold locks on upper level nodes (even when slot == 0) after we COW their child nodes/leaves, as we won't have node splits or merges in this scenario (that is, no key additions, removals or shifts on any nodes or leaves). Therefore release the locks immediately after COWing the child nodes/leaves while navigating the btree, even if their parent slot is 0, instead of returning a path to the caller with those nodes locked, which would get released only when the caller releases or frees the path (or if it calls btrfs_unlock_up_safe). This is a common scenario, for example when updating inode items in fs trees and block group items in the extent tree. The following benchmarks were performed on a quad core machine with 32Gb of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more quickly). sysbench --test=fileio --file-num=131072 --file-total-size=8G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare\|run] Before this change: 49.85Mb/s (average of 5 runs) After this change: 50.38Mb/s (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:11 -08:00
Wenliang Fan	eb8052e015	fs/btrfs: Integer overflow in btrfs_ioctl_resize() The local variable 'new_size' comes from userspace. If a large number was passed, there would be an integer overflow in the following line: new_size = old_size + new_size; Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:11 -08:00
Josef Bacik	c9ea7b24ce	Btrfs: stop caching thread if extent_commit_sem is contended We can starve out the transaction commit with a bunch of caching threads all running at the same time. This is because we will only drop the extent_commit_sem if we need_resched(), which isn't likely to happen since we will be reading a lot from the disk so have already schedule()'ed plenty. Alex observed that he could starve out a transaction commit for up to a minute with 32 caching threads all running at once. This will allow us to drop the extent_commit_sem to allow the transaction commit to swap the commit_root out and then all the cachers will start back up. Here is an explanation provided by Igno So, just to fill in what happens in this loop: mutex_unlock(&caching_ctl->mutex); cond_resched(); goto again; where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem again: again: mutex_lock(&caching_ctl->mutex); /* need to make sure the commit_root doesn't disappear */ down_read(&fs_info->extent_commit_sem); So, if I'm reading the code correct, there can be a fair amount of concurrency here: there may be multiple 'caching kthreads' per filesystem active, while there's one fs_info->extent_commit_sem per filesystem AFAICS. So, what happens if there are a lot of CPUs all busy holding the ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all rush to try to release the fs_info->extent_commit_sem, and they'd block in the down_read() because there's a writer waiting. So there's a guarantee of forward progress. This should answer akpm's concern I think. Thanks, Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:10 -08:00
Miao Xie	67de11769b	Btrfs: introduce the delayed inode ref deletion for the single link inode The inode reference item is close to inode item, so we insert it simultaneously with the inode item insertion when we create a file/directory.. In fact, we also can handle the inode reference deletion by the same way. So we made this patch to introduce the delayed inode reference deletion for the single link inode(At most case, the file doesn't has hard link, so we don't take the hard link into account). This function is based on the delayed inode mechanism. After applying this patch, we can reduce the time of the file/directory deletion by ~10%. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:09 -08:00
Miao Xie	7cf35d91b4	Btrfs: use flags instead of the bool variants in delayed node Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:08 -08:00
Miao Xie	a56dbd8940	Btrfs: remove btrfs_end_transaction_dmeta() Two reasons: - btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle() so it is unnecessary. - All the delayed items should be dealt in the current transaction, so the workers should not commit the transaction, instead, deal with the delayed items as many as possible. So we can remove btrfs_end_transaction_dmeta() Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:08 -08:00
Miao Xie	0353808cae	Btrfs: cleanup code of btrfs_balance_delayed_items() - move the condition check for wait into a function - use wait_event_interruptible instead of prepare-schedule-finish process Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:07 -08:00
Miao Xie	4dd466d36a	Btrfs: don't run delayed nodes again after all nodes flush If the number of the delayed items is greater than the upper limit, we will try to flush all the delayed items. After that, it is unnecessary to run them again because they are being dealt with by the wokers or the number of them is less than the lower limit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:06 -08:00
Miao Xie	74c40f925e	Btrfs: remove residual code in delayed inode async helper Before applying the patch commit `de3cb945db` title: Btrfs: improve the delayed inode throttling We need requeue the async work after the current work was done, it introduced a deadlock problem. So we wrote the code that this patch removes to avoid the above problem. But after applying the above patch, the deadlock problem didn't exist. So we should remove that fix code. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:06 -08:00
Frank Holton	efe120a067	Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:05 -08:00
Filipe David Borba Manana	5de865eebb	Btrfs: fix tree mod logging While running the test btrfs/004 from xfstests in a loop, it failed about 1 time out of 20 runs in my desktop. The failure happened in the backref walking part of the test, and the test's error message was like this: btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000 +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000 @@ -1,3 +1,8 @@ QA output created by 004 * test backref walking -* done +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got: + ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 Failures: btrfs/004 Failed 1 of 1 tests But immediately after the test finished, the btrfs inspect-internal command returned the expected output: $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 inode 405 offset 454656 root 258 inode 405 offset 454656 root 5 It turned out this was because the btrfs_search_old_slot() calls performed during backref walking (backref.c:__resolve_indirect_ref) were not finding anything. The reason for this turned out to be that the tree mod logging code was not logging some node multi-step operations atomically, therefore btrfs_search_old_slot() callers iterated often over an incomplete tree that wasn't fully consistent with any tree state from the past. Besides missing items, this often (but not always) resulted in -EIO errors during old slot searches, reported in dmesg like this: [ 4299.933936] ------------[ cut here ]------------ [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]() [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007 [ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70 [ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000 [ 4299.933987] Call Trace: [ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76 [ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0 [ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20 [ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs] [ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50 [ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs] [ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs] [ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs] [ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs] [ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs] [ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs] [ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00 [ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40 [ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934104] ---[ end trace 48f0cfc902491414 ]--- [ 4299.934378] btrfs bad fsid on block 0 These tree mod log operations that must be performed atomically, tree_mod_log_free_eb, tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to be performed atomically before the following commit: `c8cc634165` (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations) That change removed the atomicity of such operations. This patch restores the atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem structures, so it has to do the allocations using GFP_NOFS before acquiring the mod log lock. This issue has been experienced by several users recently, such as for example: http://www.spinics.net/lists/linux-btrfs/msg28574.html After running the btrfs/004 test for 679 consecutive iterations with this patch applied, I didn't ran into the issue anymore. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:04 -08:00
David Sterba	66ef7d65c3	btrfs: check balance of send_in_progress Warn if the balance goes below zero, which appears to be unlikely though. Otherwise cleans up the code a bit. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:04 -08:00
Wang Shilong	41ce9970a8	Btrfs: remove transaction from btrfs send Since daivd did the work that force us to use readonly snapshot, we can safely remove transaction protection from btrfs send. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:03 -08:00
Miao Xie	536cd96401	Btrfs: fix double initialization of the raid kobject We met the following oops when doing space balance: kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong. ... Call Trace: [<ffffffff81937262>] dump_stack+0x49/0x5f [<ffffffff8137d259>] kobject_init+0x89/0xa0 [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70 [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs] [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs] [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs] [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs] [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs] [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs] [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs] [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs] [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs] ... Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nospace_cache <dev> <mnt> # btrfs balance start <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 The reason of this problem is that we initialized the raid kobject when we added a block group into a empty raid list. As we know, when we mounted a btrfs filesystem, the raid list was empty, we would initialize the raid kobject when we added the first block group. But if there was not data stored in the block group, the block group would be freed when doing balance, and the raid list would be empty. And then if we allocated a new block group and added it into the raid list, we would initialize the raid kobject again, the oops happened. Fix this problem by initializing the raid kobject just when mounting the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:03 -08:00
Wang Shilong	180589efde	Btrfs: fix a warning when iput a file See the warning below: [ 1209.102076] [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs] [ 1209.102084] [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs] [ 1209.102089] [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40 [ 1209.102092] [<ffffffff8118ab2e>] evict+0x9e/0x190 [ 1209.102094] [<ffffffff8118b313>] iput+0xf3/0x180 [ 1209.102101] [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs] [ 1209.102107] [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs] clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:02 -08:00
David Sterba	2c68653787	btrfs: Check read-only status of roots during send All the subvolues that are involved in send must be read-only during the whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the status to read-write and the result of send stream is undefined if the data change unexpectedly. Fix that by adding a refcount for all involved roots and verify that there's no send in progress during SUBVOL_SETFLAGS ioctl call that does read-only -> read-write transition. We need refcounts because there are no restrictions on number of send parallel operations currently run on a single subvolume, be it source, parent or one of the multiple clone sources. Kernel is silent when the RO checks fail and returns EPERM. The same set of checks is done already in userspace before send starts. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:01 -08:00
David Sterba	a8d89f5ba0	btrfs: remove unused mnt from send_ctx Unused since `ed2590953b` "Btrfs: stop using vfs_read in send". Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:01 -08:00
David Sterba	95bc79d50d	btrfs: send: clean up dead code Remove ifdefed code: - tlv_put for 8, 16 and 32, add a generic tempalte if needed in future - tlv_put_timespec - the btrfs_timespec fields are used - fs_path_remove obsoleted long ago Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:00 -08:00
Filipe David Borba Manana	3fe81ce206	Btrfs: fix deadlock when iterating inode refs and running delayed inodes While running btrfs/004 from xfstests, after 503 iterations, dmesg reported a deadlock between tasks iterating inode refs and tasks running delayed inodes (during a transaction commit). It turns out that iterating inode refs implies doing one tree search and release all nodes in the path except the leaf node, and then passing that leaf node to btrfs_ref_to_path(), which in turn does another tree search without releasing the lock on the leaf node it received as parameter. This is a problem when other task wants to write to the btree as well and ends up updating the leaf that is read locked - the writer task locks the parent of the leaf and then blocks waiting for the leaf's lock to be released - at the same time, the task executing btrfs_ref_to_path() does a second tree search, without releasing the lock on the first leaf, and wants to access a leaf (the same or another one) that is a child of the same parent, resulting in a deadlock. The trace reported by lockdep follows. [84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds. [84314.936381] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936386] fsstress D ffff8806e1bf8000 0 11930 11926 0x00000000 [84314.936393] ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd [84314.936399] ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8 [84314.936405] ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190 [84314.936410] Call Trace: [84314.936421] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936428] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936451] [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs] [84314.936457] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936470] [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs] [84314.936489] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936504] [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [84314.936510] [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0 [84314.936528] [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs] [84314.936543] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936558] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936573] [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs] [84314.936589] [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs] [84314.936604] [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs] [84314.936620] [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs] [84314.936630] [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs] [84314.936635] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936639] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936643] [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30 [84314.936648] [<ffffffff811a3541>] iterate_supers+0xf1/0x100 [84314.936652] [<ffffffff811d0c45>] sys_sync+0x55/0x90 [84314.936658] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936660] INFO: lockdep is turned off. [84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds. [84314.936666] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936670] btrfs D ffff880541729a88 0 11955 11608 0x00000000 [84314.936674] ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd [84314.936680] ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8 [84314.936685] ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000 [84314.936690] Call Trace: [84314.936695] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936700] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936717] [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs] [84314.936721] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936733] [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs] [84314.936746] [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs] [84314.936763] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936780] [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs] [84314.936797] [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs] [84314.936813] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936830] [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs] [84314.936846] [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs] [84314.936851] [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200 [84314.936868] [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs] [84314.936873] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [84314.936877] [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00 [84314.936882] [<ffffffff81075563>] ? up_read+0x23/0x40 [84314.936886] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [84314.936892] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [84314.936896] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [84314.936901] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [84314.936906] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [84314.936910] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [84314.936915] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [84314.936920] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936922] INFO: lockdep is turned off. [84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds. [84434.866881] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84434.866886] btrfs-transacti D ffff880755b6a478 0 11921 2 0x00000000 [84434.866893] ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd [84434.866899] ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8 [84434.866904] ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0 [84434.866910] Call Trace: [84434.866920] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84434.866927] [<ffffffff81774269>] schedule+0x29/0x70 [84434.866948] [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [84434.866954] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84434.866970] [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs] [84434.866985] [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs] [84434.866999] [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs] [84434.867012] [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs] [84434.867026] [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs] [84434.867030] [<ffffffff81070dad>] kthread+0xed/0x100 [84434.867035] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 [84434.867040] [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0 [84434.867044] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:59 -08:00
Wang Shilong	663df05330	Btrfs: remove dead comments for read_csums() Chris introduced hleper function read_csums() and this function has been removed, but we forgot to remove its corresponding comments. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:59 -08:00
Filipe David Borba Manana	e223cfcd3e	Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct It's not used anywhere, so just drop it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:58 -08:00
Filipe David Borba Manana	fc28b62d64	Btrfs: fix use of uninitialized err variable fs/btrfs/file.c: In function ‘prepare_pages.isra.18’: fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:57 -08:00
Wang Shilong	54eb72c05f	Btrfs: remove unnecessary filemap writting and waiting after block group relocation We have commited transaction before, remove redundant filemap writting and waiting here, it can speed up balance relocation process. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:57 -08:00
Tsutomu Itoh	5662344b3c	Btrfs: fix error check of btrfs_lookup_dentry() Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT) instead. This keeps the return value convention consistent. Callers who use btrfs_lookup_dentry() require a trivial update. create_snapshot() in particular looks like it can also lose a BUG_ON(!inode) which is not really needed - there seems less harm in returning ENOENT to userspace at that point in the stack than there is to crash the machine. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:56 -08:00
Filipe David Borba Manana	7835776635	Btrfs: return immediately if tree log mod is not necessary In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:56 -08:00
Josef Bacik	f28491e0a6	Btrfs: move the extent buffer radix tree into the fs_info I need to create a fake tree to test qgroups and I don't want to have to setup a fake btree_inode. The fact is we only use the radix tree for the fs_info, so everybody else who allocates an extent_io_tree is just wasting the space anyway. This patch moves the radix tree and its lock into btrfs_fs_info so there is less stuff I have to fake to do qgroup sanity tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:55 -08:00
Josef Bacik	34b41acec1	Btrfs: use a bit to track if we're in the radix tree For creating a dummy in-memory btree I need to be able to use the radix tree to keep track of the buffers like normal extent buffers. With dummy buffers we skip the radix tree step, and we still want to do that for the tree mod log dummy buffers but for my test buffers we need to be able to remove them from the radix tree like normal. This will give me a way to do that. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:54 -08:00
Josef Bacik	a5dee37d39	Btrfs: deal with io_tree->mapping being NULL I need to add infrastructure to allocate dummy extent buffers for running sanity tests, and to do this I need to not have to worry about having an address_mapping for an io_tree, so just fix up the places where we assume that all io_tree's have a non-NULL ->mapping. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:54 -08:00
Filipe David Borba Manana	2ef1fed285	Btrfs: more efficient push_leaf_right Currently when finding the leaf to insert a key into a btree, if the leaf doesn't have enough space to store the item we attempt to move off some items from our leaf to its right neighbor leaf, and if this fails to create enough free space in our leaf, we try to move off more items to the left neighbor leaf as well. When trying to move off items to the right neighbor leaf, if it has enough room to store the new key but not not enough room to move off at least one item from our target leaf, __push_leaf_right returns 1 and we have to attempt to move items to the left neighbor (push_leaf_left function) without touching the right neighbor leaf. For the case where the right leaf has enough room to store at least 1 item from our leaf, we end up modifying (and dirtying) both our leaf and the right leaf. This is non-optimal for the case where the new key is greater than any key in our target leaf because it can be inserted at slot 0 of the right neighbor leaf and we don't need to touch our leaf at all nor to attempt to move off items to the left neighbor leaf. Therefore this change just selects the right neighbor leaf as our new target leaf if it has enough room for the new key without modifying our initial target leaf - we do this only if the new key is higher than any key in the initial target leaf. While running the following test, push_leaf_right was called by split_leaf 4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this special case (right leaf has enough room and new key is higher than any key in the initial target leaf). Test: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=[seqwr\|rndwr] --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare\|run] Results: sequential writes Throughput before this change: 65.71Mb/sec (average of 10 runs) Throughput after this change: 66.58Mb/sec (average of 10 runs) random writes Throughput before this change: 10.75Mb/sec (average of 10 runs) Throughput after this change: 11.56Mb/sec (average of 10 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:53 -08:00
Wang Shilong	cb7ab02156	Btrfs: wrap repeated code into scrub_blocked_if_needed() Just wrap same code into one function scrub_blocked_if_needed(). This make a change that we will move waiting (@workers_pending = 0) before we can wake up commiting transaction(atomic_inc(@scrub_paused)), we must take carefully to not deadlock here. Thread 1 Thread 2 \|->btrfs_commit_transaction() \|->set trans type(COMMIT_DOING) \|->btrfs_scrub_paused()(blocked) \|->join_transaction(blocked) Move btrfs_scrub_paused() before setting trans type which means we can still join a transaction when commiting_transaction is blocked. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:53 -08:00
Wang Shilong	3cb0929ad2	Btrfs: fix wrong super generation mismatch when scrubbing supers We came a race condition when scrubbing superblocks, the story is: In commiting transaction, we will update @last_trans_commited after writting superblocks, if scrubber start after writting superblocks and before updating @last_trans_commited, generation mismatch happens! We fix this by checking @scrub_pause_req, and we won't start a srubber until commiting transaction is finished.(after btrfs_scrub_continue() finished.) Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:52 -08:00
Filipe David Borba Manana	5a0f4e2c2b	Btrfs: fix pass of transid with wrong endianness in send.c fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:51 -08:00
Filipe David Borba Manana	d527afe1e5	Btrfs: fix extent_map block_len after merging When merging an extent_map with its right neighbor, increment its block_len with the neighbor's block_len. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:51 -08:00
Michal Nazarewicz	11850392ee	btrfs: remove dead code [commit 8185554d: fix incorrect inode acl reset] introduced a dead code by adding a condition which can never be true to an else branch. The condition can never be true because it is already checked by a previous if statement which causes function to return. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:50 -08:00
Filipe David Borba Manana	878f2d2cb3	Btrfs: fix max dir item size calculation We were accounting for sizeof(struct btrfs_item) twice, once in the data_size variable and another time in the if statement below. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:49 -08:00
Filipe David Borba Manana	12cfbad90e	Btrfs: more efficient extent state insertions Currently we do 2 traversals of an inode's extent_io_tree before inserting an extent state structure: 1 to see if a matching extent state already exists and 1 to do the insertion if the fist traversal didn't found such extent state. This change just combines those tree traversals into a single one. While running sysbench tests (random writes) I captured the number of elements in extent_io_tree trees for a while (into a procfs file backed by a seq_list from seq_file module) and got this histogram: Count: 9310 Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688 Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000 51.000 - 93.933: 693 ######## 93.933 - 172.314: 938 ########## 172.314 - 315.408: 856 ######### 315.408 - 576.646: 95 # 576.646 - 6415.830: 888 ########## 6415.830 - 11713.809: 1024 ########### 11713.809 - 21386.000: 4816 ##################################################### So traversing such trees can take some significant time that can easily be avoided. Ran the following sysbench tests, 5 times each, for sequential and random writes, and got the following results: sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync Before this change: sequential writes: 69.28Mb/sec (average of 5 runs) random writes: 4.14Mb/sec (average of 5 runs) After this change: sequential writes: 69.91Mb/sec (average of 5 runs) random writes: 5.69Mb/sec (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:49 -08:00
Filipe David Borba Manana	c42ac0bc95	Btrfs: add missing extent state caching calls When we didn't find a matching extent state, we inserted a new one but didn't cache it in the **cached_state parameter, which makes a subsequent call do a tree lookup to get it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:48 -08:00
Filipe David Borba Manana	32193c147f	Btrfs: faster and more efficient extent map insertion Before this change, adding an extent map to the extent map tree of an inode required 2 tree nevigations: 1) doing a tree navigation to search for an existing extent map starting at the same offset or an extent map that overlaps the extent map we want to insert; 2) Another tree navigation to add the extent map to the tree (if the former tree search didn't found anything). This change just merges these 2 steps into a single one. While running first few btrfs xfstests I had noticed these trees easily had a few hundred elements, and then with the following sysbench test it reached over 1100 elements very often. Test: sysbench --test=fileio --file-num=32 --file-total-size=10G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=1000000 --file-io-mode=sync [prepare\|run] (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench prepare phase) Before this patch: run 1 - 41.894Mb/sec run 2 - 40.527Mb/sec run 3 - 40.922Mb/sec run 4 - 49.433Mb/sec run 5 - 40.959Mb/sec average - 42.75Mb/sec After this patch: run 1 - 48.036Mb/sec run 2 - 50.21Mb/sec run 3 - 50.929Mb/sec run 4 - 46.881Mb/sec run 5 - 53.192Mb/sec average - 49.85Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:48 -08:00
Filipe David Borba Manana	68ba990f7d	Btrfs: fix extent boundary check in bio_readpage_error Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:47 -08:00
Filipe David Borba Manana	5a4267ca20	Btrfs: try harder to avoid btree node splits When attempting to move items from our target leaf to its neighbor leaves (right and left), we only need to free data_size - free_space bytes from our leaf in order to add the new item (which has size of data_size bytes). Therefore attempt to move items to the right and left leaves if they have at least data_size - free_space bytes free, instead of data_size bytes free. After 5 runs of the following test, I got a smaller number of btree node splits overall: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=8192 --max-requests=100000 --file-io-mode=sync Before this change: * 6171 splits (average of 5 test runs) * 61.508Mb/sec of throughput (average of 5 test runs) After this change: * 6036 splits (average of 5 test runs) * 63.533Mb/sec of throughput (average of 5 test runs) An ideal test would not just have multiple threads/processes writing to a file (insertion of file extent items) but also do other operations that result in insertion of items with varied sizes, like file/directory creations, creation of links, symlinks, xattrs, etc. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:46 -08:00
Filipe David Borba Manana	1b8e7e45e5	Btrfs: avoid unnecessary ordered extent cache resets After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer. While running the xfstests I noticed that about 29% of the time the ordered extent to which tree->last pointed was not the same as our just completed ordered extent. After that I ran the following sysbench test (after a prepare phase) and noticed that about 68% of the time tree->last pointed to a different ordered extent too. sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=rndwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --max-requests=0 run Therefore reset tree->last on ordered extent removal only if it pointed to the ordered extent we're removing from the tree. Results from 4 runs of the following test before and after applying this patch: $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync run Before this path: run 1 - 64.049Mb/sec run 2 - 63.455Mb/sec run 3 - 64.656Mb/sec run 4 - 63.833Mb/sec After this patch: run 1 - 66.149Mb/sec run 2 - 68.459Mb/sec run 3 - 66.338Mb/sec run 4 - 66.176Mb/sec With random writes (--file-test-mode=rndwr) I had huge fluctuations on the results (+- 35% easily). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:46 -08:00
Jeff Mahoney	e453d989e0	btrfs: fix leaks during sysfs teardown Filipe noticed that we were leaking the features attribute group after umount. His fix of just calling sysfs_remove_group() wasn't enough since that removes just the supported features and not the unsupported features. This patch changes the unknown feature handling to add them individually so we can skip the kmalloc and uses the same iteration to tear them down later. We also fix the error handling during mount so that we catch the failing creation of the per-super kobject, and handle proper teardown of a half-setup sysfs context. Tested properly with kmemleak enabled this time. Reported-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Tested-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:45 -08:00
Jeff Mahoney	1b8e5df6d9	btrfs: fix static checker warnings This patch fixes the following warnings: fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static? fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index)); Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:44 -08:00
Filipe David Borba Manana	131e404a2a	Btrfs: fix very slow inode eviction and fs unmount The inode eviction can be very slow, because during eviction we tell the VFS to truncate all of the inode's pages. This results in calls to btrfs_invalidatepage() which in turn does calls to lock_extent_bits() and clear_extent_bit(). These calls result in too many merges and splits of extent_state structures, which consume a lot of time and cpu when the inode has many pages. In some scenarios I have experienced umount times higher than 15 minutes, even when there's no pending IO (after a btrfs fs sync). A quick way to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m25.457s user 0m0.000s sys 0m0.092s $ cd .. $ time umount /mnt/btrfs real 1m38.234s user 0m0.000s sys 1m25.760s The same test on ext4 runs much faster: $ mkfs.ext4 /dev/sdb3 $ mount /dev/sdb3 /mnt/ext4 $ cd /mnt/ext4 $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ sync $ cd .. $ time umount /mnt/ext4 real 0m3.626s user 0m0.004s sys 0m3.012s After this patch, the unmount (inode evictions) is much faster: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m26.774s user 0m0.000s sys 0m0.084s $ cd .. $ time umount /mnt/btrfs real 0m1.811s user 0m0.000s sys 0m1.564s Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:44 -08:00
Wang Shilong	0647bf564f	Btrfs: improve forever loop when doing balance relocation We hit a forever loop when doing balance relocation,the reason is that we firstly reserve 4M(node size is 16k).and within transaction we will try to add extra reservation for snapshot roots,this will return -EAGAIN if there has been a thread flushing space to reserve space.We will do this again and again with filesystem becoming nearly full. If the above '-EAGAIN' case happens, we try to refill reservation more outsize of transaction, and this will return eariler in enospc case,however, this dosen't really hurt because it makes no sense doing balance relocation with the filesystem nearly full. Miao Xie helped a lot to track this issue, thanks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:43 -08:00
Filipe David Borba Manana	6126e3caf7	Btrfs: fix ordered extent check in btrfs_punch_hole If the ordered extent's last byte was 1 less than our region's start byte, we would unnecessarily wait for the completion of that ordered extent, because it doesn't intersect our target range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:42 -08:00
Miao Xie	376cc685cb	Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io When we ran sysbench on the fs with compression, the following WARN_ONs were triggered: fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents); fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents); fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes); Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync prepare # cd - # umount <mnt> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync run # cd - # umount <mnt> The reason of this problem is: Task0 Task1 btrfs_direct_IO unlock(&inode->i_mutex) lock(&inode->i_mutex) reserve_space() prepare_pages() lock_extent() clear_extent() unlock_extent() lock_extent() test_extent(uptodate) return false copy_data() set_delalloc_extent() extent need compress go back to buffered write clear_extent(DELALLOC \| DIRTY) unlock_extent() Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which was set by task1, it made the dirty pages in that extents couldn't be flushed into the disk, so the reserved space for that extent was not released at the end. This patch fixes the above bug by unlocking the extent after the delalloc. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:42 -08:00
Miao Xie	b37392ea86	Btrfs: cleanup unnecessary parameter and variant of prepare_pages() - the caller has gotten the inode object, needn't pass the file object. And if so, we needn't define a inode pointer variant. - the position should be aligned by the page size not sector size, so we also needn't pass the root object into prepare_pages(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:41 -08:00

... 2 3 4 5 6 ...

35215 Коммитов