Pull vfs pile 2 (of many) from Al Viro:
"Mostly Miklos' series this time"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify dcache.c inlined helpers where possible
fuse: drop dentry on failed revalidate
fuse: clean up return in fuse_dentry_revalidate()
fuse: use d_materialise_unique()
sysfs: use check_submounts_and_drop()
nfs: use check_submounts_and_drop()
gfs2: use check_submounts_and_drop()
afs: use check_submounts_and_drop()
vfs: check unlinked ancestors before mount
vfs: check submounts and drop atomically
vfs: add d_walk()
vfs: restructure d_genocide()
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.
check_submounts_and_drop() can deal with negative dentries and
non-directories as well.
Non-directories can also be mounted on. And just like directories we don't
want these to disappear with invalidation.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
it actually increased the file size. If gfs2_fsync was called without
I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
returning, so any metadata or journaled data changes were not getting
fsynced. This meant that writes to the middle of files were not always
getting fsynced properly.
This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
updated during a write. It also make gfs2_sync flush the incore log
if I_DIRTY_PAGES is set, and the file is using data journalling. This
will make sure that all incore logged data gets written to disk before
returning from a fsync.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch checks for the first mounter being a specator. If so, it
makes sure all the journals are clean. If there's a dirty journal,
the mount fails.
Testing results:
# insmod gfs2.ko
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
mount: permission denied
# dmesg | tail -2
[ 3390.655996] GFS2: fsid=MUSKETEER:home: Now mounting FS...
[ 3390.841336] GFS2: fsid=MUSKETEER:home.s: jid=0: Journal is dirty, so the first mounter must not be a spectator.
# mount -tgfs2 /dev/sasdrives/scratch /mnt/gfs2
# umount /mnt/gfs2
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
# ls /mnt/gfs2|wc -l
352
# umount /mnt/gfs2
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function test_and_clear_bit implies a memory barrier, so subsequent
memory barriers are unnecessary.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The writepages function was recently merged between writeback
and ordered mode. This completes the change by doing the same
with writepage. The remaining differences in writepage were
left over from some earlier time and not actually doing anything
useful.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
We need to check the glock ref counter in a race free way
in order to ensure that the gfs2_glock_hold() call will
succeed. The easiest way to do that is to simply take the
reference count early in the common code of examine_bucket,
skipping any glocks with zero ref count.
That means that the examiner functions all need to put their
reference on the glock once they've performed their function.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
Since gfs2_sync_meta() is only called from a single file, lets move
it to lops.c where it is used, and mark it static. At the same
time, we can clean up the meta_io.h header too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since the introduction of atomic_open, gfs2_getxattr can be
called with the glock already held, so we need to allow for
this.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
alloc_workqueue() returns a NULL on error, it doesn't return an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When run during fsync, a gfs2_log_flush could happen between the
time when gfs2_ail_flush checked the number of blocks to revoke,
and when it actually started the transaction to do those revokes.
This occassionally caused it to need more revokes than it reserved,
causing gfs2 to crash.
Instead of just reserving enough revokes to handle the blocks that
currently need them, this patch makes gfs2_ail_flush reserve the
maximum number of revokes it can, without increasing the total number
of reserved log blocks. This patch also passes the number of reserved
revokes to __gfs2_ail_flush() so that it doesn't go over its limit
and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
will still cause a BUG() necessary revokes are skipped.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
PTR_RET should be PTR_ERR
Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
Here's the big driver core merge for 3.11-rc1
Lots of little things, and larger firmware subsystem updates, all
described in the shortlog. Nice thing here is that we finally get rid
of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
been always on for a number of kernel releases, now it's just removed.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlHRsGMACgkQMUfUDdst+ylIIACfW8lLxOPVK+iYG699TWEBAkp0
LFEAnjlpAMJ1JnoZCuWDZObNCev93zGB
=020+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the big driver core merge for 3.11-rc1
Lots of little things, and larger firmware subsystem updates, all
described in the shortlog. Nice thing here is that we finally get rid
of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
been always on for a number of kernel releases, now it's just
removed)"
* tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
driver core: device.h: fix doc compilation warnings
firmware loader: fix another compile warning with PM_SLEEP unset
build some drivers only when compile-testing
firmware loader: fix compile warning with PM_SLEEP set
kobject: sanitize argument for format string
sysfs_notify is only possible on file attributes
firmware loader: simplify holding module for request_firmware
firmware loader: don't export cache_firmware and uncache_firmware
drivers/base: Use attribute groups to create sysfs memory files
firmware loader: fix compile warning
firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER
Documentation: Updated broken link in HOWTO
Finally eradicate CONFIG_HOTPLUG
driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend
driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware
Documentation: Tidy up some drivers/base/core.c kerneldoc content.
platform_device: use a macro instead of platform_driver_register
firmware: move EXPORT_SYMBOL annotations
firmware: Avoid deadlock of usermodehelper lock at shutdown
dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly
...
Pull GFS2 updates from Steven Whitehouse:
"There are a few bug fixes for various, mostly very minor corner cases,
plus some interesting new features.
The new features include atomic_open whose main benefit will be the
reduction in locking overhead in case of combined lookup/create and
open operations, sorting the log buffer lists by block number to
improve the efficiency of AIL writeback, and aggressively issuing
revokes in gfs2_log_flush to reduce overhead when dropping glocks."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Reserve journal space for quota change in do_grow
GFS2: Fix fstrim boundary conditions
GFS2: fix warning message
GFS2: aggressively issue revokes in gfs2_log_flush
GFS2: fix regression in dir_double_exhash
GFS2: Add atomic_open support
GFS2: Only do one directory search on create
GFS2: fix error propagation in init_threads()
GFS2: Remove no-op wrapper function
GFS2: Cocci spatch "ptr_ret.spatch"
GFS2: Eliminate gfs2_rg_lops
GFS2: Sort buffer lists by inplace block number
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
+TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
XrddRLGlk4Hf+2o7/D7B
=SwaI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.
->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.
Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.
For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the blocked_list is done without dropping the
lock in between.
On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.
With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.
Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a GFS2 file system is mounted with quotas and a file is grown
in such a way that its free blocks for the allocation are represented
in a secondary bitmap, GFS2 ran out of blocks in the transaction.
That resulted in "fatal: assertion "tr->tr_num_buf <= tr->tr_blocks".
This patch reserves extra blocks for the quota change so the
transaction has enough space.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch correctly distinguishes two boundary conditions:
1. When the given range is entire within the unaccounted space between
two rgrps, and
2. The range begins beyond the end of the filesystem
Also fix the unit of the returned value r.len (total trimming) to be in bytes
instead of the (incorrect) 512 byte blocks
With this patch, GFS2 passes multiple iterations of all the relevant xfstests
(251, 260, 288) with different fs block sizes.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a warning message introduced in the recent
"GFS2: aggressively issue revokes in gfs2_log_flush" patch.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch looks at all the outstanding blocks in all the transactions
on the log, and moves the completed ones to the ail2 list. Then it
issues revokes for these blocks. This will hopefully speed things up
in situations where there is a lot of contention for glocks, especially
if they are acquired serially.
revoke_lo_before_commit will issue at most one log block's full of these
preemptive revokes. The amount of reserved log space that
gfs2_log_reserve() ignores has been incremented to allow for this extra
block.
This patch also consolidates the common revoke instructions into one
function, gfs2_add_revoke().
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Recent commit e8830d8 introduced a bug in function dir_double_exhash;
it was failing to set h in the fall-back case. This patch corrects it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I've restricted atomic_open to only operate on regular files, although
I still don't understand why atomic_open should not be possible also for
directories on GFS2. That can always be added in later though, if it
makes sense.
The ->atomic_open function can be passed negative dentries, which
in most cases means either ENOENT (->lookup) or a call to d_instantiate
(->create). In the GFS2 case though, we need to actually perform the
look up, since we do not know whether there has been a new inode created
on another node. The look up calls d_splice_alias which then tries to
rehash the dentry - so the solution here is to simply check for that
in d_splice_alias. The same issue is likely to affect any other cluster
filesystem implementing ->atomic_open
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields fieldses org>
Cc: Jeff Layton <jlayton@redhat.com>
Creation of a new inode requires a directory search in order to ensure
that we are not trying to create an inode with the same name as an
existing one. This was hidden away inside the create_ok() function.
In the case that there was an existing inode, and a lookup can be
substituted for a create (which is the case with regular files
when the O_EXCL flag is not in use) then we were doing a second
lookup in order to return the inode.
This patch merges these two lookups into one. This can be done by
passing a flag to gfs2_dir_search() to tell it to just return -EEXIST
in the cases where we don't actually want to look up the inode.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If kthread_run() fails, init_threads() returns
IS_ERR(p) instead of PTR_ERR(p).
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Use PTR_RET in place of open coding this function.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With recent changes to the transactions, it appears that we
are no longer using the "log ops" for resource groups. Since the
log commit code processes the array of log ops, eliminating this
should be marginally better for performance. Therefore this patch
eliminates it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch simply sort the data and metadata buffer lists by their
inplace block number. This makes gfs2_log_flush issue the inplace IO
in sequential order, which will hopefully speed up writing the IO
out to disk.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Ever since commit 45f035ab9b ("CONFIG_HOTPLUG should be always on"),
it has been basically impossible to build a kernel with CONFIG_HOTPLUG
turned off. Remove all the remaining references to it.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch makes GFS2 immediately reclaim/delete all iopen glocks
as soon as they're dequeued. This allows deleters to get an
EXclusive lock on iopen so files are deleted properly instead of
being set as unlinked.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This version has one more correction: the vmalloc calls are replaced
by __vmalloc calls to preserve the GFP_NOFS flag.
When GFS2's directory management code allocates buffers for a
directory hash table, if it can't get the memory it needs, it
currently gives a bad return code. Rather than giving an error,
this patch allows it to use virtual memory rather than kernel
memory for the hash table. This should make it possible for
directories to function properly, even when kernel memory becomes
very fragmented.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch calls get_write_access in a few functions. This
merely increases inode->i_writecount for the duration of the function.
That will ensure that any file closes won't delete the inode's
multi-block reservation while the function is running.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch sets the log descriptor type according to whether the
journal commit is for (journaled) data or metadata. This was
recently broken when the functions to process data and metadata
log ops were combined.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix build errors by correcting DLM dependencies in GFS2.
Build errors happen when CONFIG_GFS2_FS_LOCKING_DLM=y and CONFIG_DLM=m:
fs/built-in.o: In function `gfs2_lock':
file.c:(.text+0xc7abd): undefined reference to `dlm_posix_get'
file.c:(.text+0xc7ad0): undefined reference to `dlm_posix_unlock'
file.c:(.text+0xc7ad9): undefined reference to `dlm_posix_lock'
fs/built-in.o: In function `gdlm_unmount':
lock_dlm.c:(.text+0xd6e5b): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `sync_unlock':
lock_dlm.c:(.text+0xd6e9e): undefined reference to `dlm_unlock'
fs/built-in.o: In function `sync_lock':
lock_dlm.c:(.text+0xd6fb6): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_put_lock':
lock_dlm.c:(.text+0xd7238): undefined reference to `dlm_unlock'
fs/built-in.o: In function `gdlm_mount':
lock_dlm.c:(.text+0xd753e): undefined reference to `dlm_new_lockspace'
lock_dlm.c:(.text+0xd79d3): undefined reference to `dlm_release_lockspace'
fs/built-in.o: In function `gdlm_lock':
lock_dlm.c:(.text+0xd8179): undefined reference to `dlm_lock'
fs/built-in.o: In function `gdlm_cancel':
lock_dlm.c:(.text+0xd6b22): undefined reference to `dlm_unlock'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the multi-block allocation code, such that
directory inodes only get a single block reserved in the bitmap.
That way, the bitmaps are more tightly packed together, and there
are fewer spans of free blocks for in-use block reservations.
This means it takes less time to find a free span of blocks in the
bitmap, which speeds things up. This increases the performance of
some workloads by almost 2X. In Nate's mockup.py script (which does
(1) create dir, (2) create dir in dir, (3) create file in that dir)
the test executes in 23 steps rather than 43 steps, a 47%
performance improvement.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes two regression problems that Abhi found in the
GFS2 quota code.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
->invalidatepage() aop now accepts range to invalidate so we can make
use of it in gfs2_invalidatepage().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
Pull GFS2 updates from Steven Whitehouse:
"There is not a whole lot of change this time - there are some further
changes which are in the works, but those will be held over until next
time.
Here there are some clean ups to inode creation, the addition of an
origin (local or remote) indicator to glock demote requests, removal
of one of the remaining GFP_NOFAIL allocations during log flushes, one
minor clean up, and a one liner bug fix."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Flush work queue before clearing glock hash tables
GFS2: Add origin indicator to glock demote tracing
GFS2: Add origin indicator to glock callbacks
GFS2: replace gfs2_ail structure with gfs2_trans
GFS2: Remove vestigial parameter ip from function rs_deltree
GFS2: Use gfs2_dinode_out() in the inode create path
GFS2: Remove gfs2_refresh_inode from inode creation path
GFS2: Clean up inode creation path
Use the new vsprintf extension to avoid any possible
message interleaving.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
There was a timing window when a GFS2 file system was unmounted
that caused GFS2 to call BUG() and panic the kernel. The call
to BUG() is meant to ensure that the glock reference count,
gl_ref, never gets down to zero and bounce back up again. What was
happening during umount is that function gfs2_put_super was dequeing
its glocks for well-known files. In particular, we saw it on the
journal glock, sd_jinode_gh. The dequeue caused delayed work to be
queued for the glock state machine, to transition the lock to an
"unlocked" state. While the work was still queued, gfs2_put_super
called gfs2_gl_hash_clear to clear out the glock hash tables.
If the timing was just so, the glock work function would drop the
reference count at the time when it was being checked for zero,
and that caused BUG() to be called. This patch calls
flush_workqueue before clearing the glock hash tables, thereby
ensuring that the delayed work is executed before the hash tables
are cleared, and therefore the reference count never goes to zero
until the glock is cleared.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds the origin indicator to the trace point for glock
demotion, so that it is possible to see where demote requests
have come from.
Note that requests generated from the demote_rq sysfs interface
will show as remote, since they are intended to replicate
exactly the effect of a demote reuqest from a remote node. It
is still possible to tell these apart by looking at the process
which initiated the demote request.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a bool indicating whether the demote
request was originated locally or remotely. This is then
used by the iopen ->go_callback() to make 100% sure that
it will only respond to remote callbacks.
Since ->evict_inode() uses GL_NOCACHE when it attempts to
get an exclusive lock on the iopen lock, this may result
in extra scheduling of the workqueue in case that the
exclusive promotion request failed. This patch prevents
that from happening.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In order to allow transactions and log flushes to happen at the same
time, gfs2 needs to move the transaction accounting and active items
list code into the gfs2_trans structure. As a first step toward this,
this patch removes the gfs2_ail structure, and handles the active items
list in the gfs_trans structure. This keeps gfs2 from allocating an ail
structure on log flushes, and gives us a struture that can later be used
to store the transaction accounting outside of the gfs2 superblock
structure.
With this patch, at the end of a transaction, gfs2 will add the
gfs2_trans structure to the superblock if there is not one already.
This structure now has the active items fields that were previously in
gfs2_ail. This is not necessary in the case where the transaction was
simply used to add revokes, since these are never written outside of the
journal, and thus, don't need an active items list.
Also, in order to make sure that the transaction structure is not
removed while it's still in use by gfs2_trans_end, unlocking the
sd_log_flush_lock has to happen slightly later in ending the
transaction.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The functions that delete block reservations from the rgrp block
reservations rbtree no longer use the ip parameter. This patch
eliminates the parameter.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Over the previous two patches relating to inode creation, the
content of init_dinode() has been looking more and more like
gfs2_dinode_out(). This is not an accident! This patch replaces
the parts of init_dinode() which are duplicated in gfs2_dinode_out()
with a call to that function.
Mostly that is straightforward, but there is one issue which needed
to be resolved relating to the link count. The link count has to be
set to zero in a certain error handling code path, which lands up
calling iput(). This is now done specifically in that code path
allowing the link count to be set earlier and written into the
on disk inode by gfs2_dinode_put() in the normal way.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The original method for creating inodes used in GFS2 was to fill
out a buffer, with all the information, and then to read that
buffer into the in-core inode, using gfs2_refresh_inode()
The problem with this approach is that all the inode's fields
need to be calculated ahead of time, and were stored in various
variables making the code rather complicated.
The new approach is simply to allocate the in-core inode earlier
and fill in as many fields as possible ahead of time. These can
then be used to initilise the on disk representation. The
code has been working towards the point where it is possible
to remove gfs2_refresh_inode() because all the fields are
correctly initialised ahead of time. We've now reached that
milestone, and have reversed the order of setting up the in
core and on disk inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch cleans up the inode creation code path in GFS2. After the
Orlov allocator was merged, a number of potential improvements are
now possible, and this is a first set of these.
The quota handling is now updated so that it matches the point in
the code where the allocation takes place. This means that the one
exception in gfs2_alloc_blocks relating to quota is now no longer
required, and we can use the generic code everywhere.
In addition the call to figure out whether we need to allocate any
extra blocks in order to add a directory entry is moved higher up
gfs2_create_inode. This means that if it returns an error, we
can deal with that at a stage where it is easier to handle that case.
The returned status cannot change during the function since we hold
an exclusive lock on the directory.
Two calls to gfs2_rindex_update have been changed to one, again at
the top of gfs2_create_inode to simplify error handling.
The time stamps are also now initialised earlier in the creation
process, this is gradually moving towards being able to remove the
call to gfs2_refresh_inode in gfs2_inode_create once we have all the
fields covered.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes GFS2's discard issuing code so that it calls
function sb_issue_discard rather than blkdev_issue_discard. The
code was calling blkdev_issue_discard and specifying the correct
sector offset and sector size, but blkdev_issue_discard expects
these values to be in terms of 512 byte sectors, even if the native
sector size for the device is different. Calling sb_issue_discard
with the BLOCK size instead ensures the correct block-to-512b-sector
translation. I verified that "minlen" is specified in blocks, so
comparing it to a number of blocks is correct.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When withdraw occurs, we need to continue to allow unlocks of fcntl
locks to occur, however these will only be local, since the node has
withdrawn from the cluster. This prevents triggering a VFS level
bug trap due to locks remaining when a file is closed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The error code in gfs2_rs_alloc() is set to ENOMEM when error
but never be used, instead, gfs2_rs_alloc() always return 0.
Fix to return 'error'.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Use memchr_inv to verify that the specified memory range is cleared.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
The temp lvb bitmap was on the stack, which could
be an alignment problem for __set_bit_le. Use
kmalloc for it instead.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
According to SUSv3:
[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.
[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.
So -EPERM should be returned if capability checks fails.
Strictly speaking this is an API change since the error code user sees is
altered.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading dinodes from the disk convert uids and gids
into kuids and kgids to store in vfs data structures.
When writing to dinodes to the disk convert kuids and kgids
in the in memory structures into plain uids and gids.
For now all on disk data structures are assumed to be
stored in the initial user namespace.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Where kuid_t values are compared use uid_eq and where kgid_t values
are compared use gid_eq. This is unfortunately necessary because
of the type safety that keeps someone from accidentally mixing
kuids and kgids with other types.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Remove the QUOTA_USER and QUOTA_GRUP defines. Remove
the last vestigal users of QUOTA_USER and QUOTA_GROUP.
Now that struct kqid is used throughout the gfs2 quota
code the need there is to use QUOTA_USER and QUOTA_GROUP
and the defines are just extraneous and confusing.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Change qd_id in struct gfs2_qutoa_data to struct kqid.
- Remove the now unnecessary QDF_USER bit field in qd_flags.
- Propopoage this change through the code generally making
things simpler along the way.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- In quota_refresh_user_store convert the user supplied uid
into a kqid and pass it to gfs2_quota_refresh.
- In quota_refresh_group_store convert the user supplied gid
into a kqid and pass it to gfs2_quota_refresh.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Both qd_alloc and qd2offset perform the exact same computation
to get an index from a gfs2_quota_data. Make life a little
simpler and factor out this index computation.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When a quota is queried return the uid or the gid in the mapped into
the caller's user namespace. In addition perform the munged version
of the mapping so that instead of -1 a value that does not map is
reported as the overflowuid or the overflowgid.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE
so the constants may be well typed.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In set_dqblk it is an error to look at fdq->d_id or fdq->d_flags.
Userspace quota applications do not set these fields when calling
quotactl(Q_XSETQLIM,...), and the kernel does not set those fields
when quota_setquota calls set_dqblk.
gfs2 never looks at fdq->d_id or fdq->d_flags after checking
to see if they match the id and type supplied to set_dqblk.
No other linux filesystem in set_dqblk looks at either fdq->d_id
or fdq->d_flags.
Therefore remove these bogus checks from gfs2 and allow normal
quota setting applications to work.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch reinstates the ack system which withdraw should be using. It
appears to have been accidentally forgotten when the lock module was
merged into GFS2, due to two different sysfs files having the same name.
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The intent here is to split the processing of the glock lru
list into two parts, so that the selection of glocks and the
disposal are separate functions. The plan is then, that further
updates can then be made to these functions in the future
to improve the selection of glocks and also the efficiency of
glock disposal.
The new feature which this patch brings is sorting the
glocks to be disposed of into glock number (and thus also
disk block number) order. Not all glocks will need i/o in
order to dispose of them, but some will, and at least we'll
generate mostly disk block order i/o now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.
The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.
The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The freeze code has not been looked at a lot recently. Upstream has
moved on, and this is an attempt to catch us back up again. There
is a vfs level interface for the freeze code which can be called
from our (obsolete, but kept for backward compatibility purposes)
sysfs freeze interface. This means freezing this way vs. doing it
from the ioctl should now work in identical fashion.
As a result of this, the freeze function is only called once
and we can drop our own special purpose code for counting the
number of freezes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The locking in gfs2_attach_bufdata() was type specific (data/meta)
which made the function rather confusing. This patch moves the core
of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()
As a result all of the locking related to adding data and metadata to
the journal is now in these two functions. This should help to clarify
what is going on, and give us some opportunities to simplify in
some cases.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch copies the body of gfs2_trans_add_bh into the two newly
added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
then move the .lo_add functions from lops.c into trans.c and call
them directly.
As a result of this, we no longer need to use the .lo_add functions
at all, so that is removed from the log operations structure.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.
Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This moves the lo_add function for revokes into trans.c, removing
a function call and making the code easier to read.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This breaks out the LRU scanning function from the shrinker in
preparation for adding other callers to the LRU scanner.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The recent commit fb6791d100
included the wrong logic. The lvbptr check was incorrectly
added after the patch was tested.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In function rg_mblk_search, it's searching for multiple blocks in
a given state (e.g. "free"). If there's an active block reservation
its goal is the next free block of that. If the resource group
contains the dinode's goal block, that's used for the search. But
if neither is the case, it uses the rgrp's last allocated block.
That way, consecutive allocations appear after one another on media.
The problem comes in when you hit the end of the rgrp; it would never
start over and search from the beginning. This became a problem,
since if you deleted all the files and data from the rgrp, it would
never start over and find free blocks. So it had to keep searching
further out on the media to allocate blocks. This patch resets the
rd_last_alloc after it does an unsuccessful search at the end of
the rgrp.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a return code check after calling function
gfs2_rbm_from_block while determining the free extent size.
That way, when the end of an rgrp is reached, it won't try
to process unaligned blocks after the end.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
QE aio tests uncovered a race condition in gfs2_rs_alloc where it's possible
to come out of the function with a valid ip->i_res allocation but it gets
freed before use resulting in a NULL ptr dereference.
This patch envelopes the initial short-circuit check for non-NULL ip->i_res
into the mutex lock. With this patch, I was able to successfully run the
reproducer test multiple times.
Resolves: rhbz#878476
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When generating the DLM lock name, a value of 0 would skip
the loop and leave the string unchanged. This left locks with
a value of 0 unlabeled. Initializing the string to '0' fixes this.
Signed-off-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull GFS2 updates from Steven Whitehouse:
"The main feature this time is the new Orlov allocator and the patches
leading up to it which allow us to allocate new inodes from their own
allocation context, rather than borrowing that of their parent
directory. It is this change which then allows us to choose a
different location for subdirectories when required. This works
exactly as per the ext3 implementation from the users point of view.
In addition to that, we've got a speed up in gfs2_rbm_from_block()
from Bob Peterson, three locking related improvements from Dave
Teigland plus a selection of smaller bug fixes and clean ups."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Set gl_object during inode create
GFS2: add error check while allocating new inodes
GFS2: don't reference inode's glock during block allocation trace
GFS2: remove redundant lvb pointer
GFS2: only use lvb on glocks that need it
GFS2: skip dlm_unlock calls in unmount
GFS2: Fix one RG corner case
GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode
GFS2: Use dirty_inode in gfs2_dir_add
GFS2: Fix truncation of journaled data files
GFS2: Add Orlov allocator
GFS2: Use proper allocation context for new inodes
GFS2: Add test for resource group congestion status
GFS2: Rename glops go_xmote_th to go_sync
GFS2: Speed up gfs2_rbm_from_block
GFS2: Review bug traps in glops.c
Overhaul struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*. By this
approach we consistently name the .private_* elements from struct
address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.
Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change (->private_data).
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a cluster coherency problem that occurs when one
node creates a file, does several writes, then a different node
tries to write to the same file. When the inode's glock is demoted,
the inode wasn't synced to the media properly because the gl_object
wasn't set. Later, the flush daemon noticed the uncommitted data
and tried to flush it, only to discover the glock was no longer locked
properly in exclusive mode. That caused an assert withdraw.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a return code check after attempting to allocate
a new inode during dinode creation.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the block allocation trace so that it references
the rgd's glock rather than the inode's glock. Now that the order
of inode creation is switched, this prevents a reference to the
glock which may not be set yet.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The lksb struct already contains a pointer to the lvb,
so another directly from the glock struct is not needed.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Save the effort of allocating, reading and writing
the lvb for most glocks that do not use it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When unmounting, gfs2 does a full dlm_unlock operation on every
cached lock. This can create a very large amount of work and can
take a long time to complete. However, the vast majority of these
dlm unlock operations are unnecessary because after all the unlocks
are done, gfs2 leaves the dlm lockspace, which automatically clears
the locks of the leaving node, without unlocking each one individually.
So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
remove the locks implicitly. The one exception is when the lock's lvb is
being used. In this case, dlm_unlock is called because it may update the
lvb of the resource.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For filesystems with only a single resource group, we need to be careful
that the allocation loop will not land up with a NULL resource group. This
fixes a bug in a previous patch where the gfs2_rgrpd_get_next() function
was being used instead of gfs2_rgrpd_get_first()
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we now have a dirty_inode that takes care of manipulating the
inode buffer and writing from the inode to the buffer, we can
eliminate some unnecessary buffer manipulations in gfs2_unlink_inode
that are now redundant.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the gfs2_dir_add function so that it uses
the dirty_inode function (via mark_inode_dirty) rather than manually
updating the dinode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes an issue relating to not having enough revokes
available when truncating journaled data files. In order to ensure
that we do no run out, the truncation is broken into separate pieces
if it is large enough.
Tested using fsx on a journaled data file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Just like ext3, this works on the root directory and any directory
with the +T flag set. Also, just like ext3, any subdirectory created
in one of the just mentioned cases will be allocated to a random
resource group (GFS2 equivalent of a block group).
If you are creating a set of directories, each of which will contain a
job running on a different node, then by setting +T on the parent
directory before creating the subdirectories, each will land up in a
different resource group, and thus resource group contention between
nodes will be kept to a minimum.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Rather than using the parent directory's allocation context, this
patch allocated the new inode earlier in the process and then uses
it to contain all the information required. As a result, we can now
use the new inode's own allocation context to allocate it rather
than having to use the parent directory's context. This give us a
lot more flexibility in where the inode is placed on disk.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch uses information gathered by the recent glock statistics
patch in order to derrive a boolean verdict on the congestion
status of a resource group. This is then used when making decisions
on which resource group to choose during block allocation.
The aim is to avoid resource groups which are heavily contended
by other nodes, while still ensuring locality of access wherever
possible.
Once a reservation has been made in a particular resource group
we continue to use that resource group until a new reservation is
required. This should help to ensure that we do not change resource
groups too often.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
[Editorial: This is a nit, but has been a minor irritation for a long time:]
This patch renames glops structure item for go_xmote_th to go_sync.
The functionality is unchanged; it's just for readability.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch is a rewrite of function gfs2_rbm_from_block. Rather than
looping to find the right bitmap, the code now does a few simple
math calculations.
I compared the performance of both algorithms side by side and the new
algorithm is noticeably faster. Sample instrumentation output from a
"fast" machine:
5 million calls: millisec spent: Orig: 166 New: 113
5 million calls: millisec spent: Orig: 189 New: 114
In addition, I ran postmark (on a somewhat slowr CPU) before the after
the new algorithm was put in place and postmark showed a decent
improvement:
Before the new algorithm:
-------------------------
Time:
645 seconds total
584 seconds of transactions (171 per second)
Files:
150087 created (232 per second)
Creation alone: 100000 files (2083 per second)
Mixed with transactions: 50087 files (85 per second)
49995 read (85 per second)
49991 appended (85 per second)
150087 deleted (232 per second)
Deletion alone: 100174 files (7705 per second)
Mixed with transactions: 49913 files (85 per second)
Data:
273.42 megabytes read (434.08 kilobytes per second)
852.13 megabytes written (1.32 megabytes per second)
With the new algorithm:
-----------------------
Time:
599 seconds total
530 seconds of transactions (188 per second)
Files:
150087 created (250 per second)
Creation alone: 100000 files (1886 per second)
Mixed with transactions: 50087 files (94 per second)
49995 read (94 per second)
49991 appended (94 per second)
150087 deleted (250 per second)
Deletion alone: 100174 files (6260 per second)
Mixed with transactions: 49913 files (94 per second)
Data:
273.42 megabytes read (467.42 kilobytes per second)
852.13 megabytes written (1.42 megabytes per second)
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Two of the bug traps here could really be warnings. The others are
converted from BUG() to GLOCK_BUG_ON() since we'll most likely
need to know the glock state in order to debug any issues which
arise. As a result of this, __dump_glock has to be renamed and
is no longer static.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In gfs2_trans_add_bh(), gfs2 was testing if a there was a bd attached to the
buffer without having the gfs2_log_lock held. It was then assuming it would
stay attached for the rest of the function. However, without either the log
lock being held of the buffer locked, __gfs2_ail_flush() could detach bd at any
time. This patch moves the locking before the test. If there isn't a bd
already attached, gfs2 can safely allocate one and attach it before locking.
There is no way that the newly allocated bd could be on the ail list,
and thus no way for __gfs2_ail_flush() to detach it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
file_accessed() was being called by gfs2_mmap() with a shared glock. If it
needed to update the atime, it was crashing because it dirtied the inode in
gfs2_dirty_inode() without holding an exclusive lock. gfs2_dirty_inode()
checked if the caller was already holding a glock, but it didn't make sure that
the glock was in the exclusive state. Now, instead of calling file_accessed()
while holding the shared lock in gfs2_mmap(), file_accessed() is called after
grabbing and releasing the glock to update the inode. If file_accessed() needs
to update the atime, it will grab an exclusive lock in gfs2_dirty_inode().
gfs2_dirty_inode() now also checks to make sure that if the calling process has
already locked the glock, it has an exclusive lock.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently implementation in gfs2 uses FITRIM arguments as it were in
file system blocks units which is wrong. The FITRIM arguments
(fstrim_range.start, fstrim_range.len and fstrim_range.minlen) are
actually in bytes.
Moreover, check for start argument beyond the end of file system, len
argument being smaller than file system block and minlen argument being
bigger than biggest resource group were missing.
This commit converts the code to convert FITRIM argument to file system
blocks and also adds appropriate checks mentioned above.
All the problems were recognised by xfstests 251 and 260.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the fstrim_range argument is not provided by user in FITRIM ioctl
we should just return EFAULT and not promoting bad behaviour by filling
the structure in kernel. Let the user deal with it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cleans up two cases where variables were assigned values but then never
used again.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Despite the return value from kmem_cache_zalloc() being checked, the
error wasn't being returned until after a possible null pointer
dereference. This patch returns the error immediately, allowing the
removal of the error variable.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Check the return value of gfs2_rs_alloc(ip) and avoid a possible null
pointer dereference.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
u64 inum = fid->raw[2];
which is unhelpfully reported as at the end of shmem_alloc_inode():
BUG: unable to handle kernel paging request at ffff880061cd3000
IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Call Trace:
[<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
[<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
[<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
[<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
Right, tmpfs is being stupid to access fid->raw[2] before validating that
fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
fall at the end of a page, and the next page not be present.
But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
could oops in the same way: add the missing fh_len checks to those.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Sage Weil <sage@inktank.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
If a dirty GFS2 inode was being deleted but was in use by another node, its
metadata was not getting written out before GFS2 checked for dirty buffers in
gfs2_ail_flush(). GFS2 was relying on inode_go_sync() to write out the
metadata when the other node tried to free the file, but it failed the error
check before it got that far. This patch writes out the metadata before calling
gfs2_ail_flush()
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_ail_empty_gl() contains an "inline version" of gfs2_trans_begin(),
so it needs an explicit sb_start_intwrite() as well, to balance the
sb_end_intwrite() which will be called by gfs2_trans_end().
With this, xfstest 068 passes on lock_nolock local gfs2.
Without it, we reach a writer count of -1 and get stuck.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes an infinite loop in gfs2_rbm_find that was introduced
by the previous patch. The problem occurred when the length was less
than 3 but the rbm block was byte-aligned, causing it to improperly
return a extent length of zero, which caused it to spin.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
With the recently added block reservation code, an additional function
was added to search for free blocks. This had a restriction of only being
able to search for aligned extents of free blocks. As a result the
allocation patterns when reserving blocks were suboptimal when the
existing allocation of blocks for an inode was not aligned to the same
boundary.
This patch resolves that problem by adding the ability for gfs2_rbm_find
to search for extents of a particular minimum size. We can then use
gfs2_rbm_find for both looking for reservations, and also looking for
free blocks on an individual basis when we actually come to do the
allocation later on. As a result we only need a single set of code
to deal with both situations.
The function gfs2_rbm_from_block() is moved up rgrp.c so that it
occurs before all of its callers.
Many thanks are due to Bob for helping track down the final issue in
this patch. That fix to the rb_tree traversal and to not share
block reservations from a dirctory to its children is included here.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
GFS2 uses i_mutex on its system quota inode to synchronize writes to
quota file. Since this is an internal inode to GFS2 (not part of directory
hiearchy or visible by user) we are safe to define locking rules for it. So
let's just get it its own locking class to make it clear.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch stops multiple block allocations if a nonzero
return code is received from gfs2_rbm_from_block. Without
this patch, if enough pressure is put on the file system,
you get a kernel warning quickly followed by:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa04f47e8>] gfs2_alloc_blocks+0x2c8/0x880 [gfs2]
With this patch, things run normally.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When rgd->rd_free_clone is less than rgd->rd_reserved, the
unclaimed_blocks() calculation would wrap and produce
incorrect results. This patch checks for this condition
when this function is called from gfs2_mblk_search()
In addition, the use of this particular function in other
places in the code has been dropped by means of a general
clean up of gfs2_inplace_reserve(). This function is now
much easier to follow.
Also the setting of the rgd->rd_last_alloc field is corrected.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch improves the tracing of block reservations by
removing some corner cases and also providing more useful
detail in the traces.
A new field is added to the reservation structure to contain
the inode number. This is used since in certain contexts it is
not possible to access the inode itself to obtain this information.
As a result we can then display the inode number for all tracepoints
and also in case we dump the resource group.
The "del" tracepoint operation has been removed. This could be called
with the reservation rgrp set to NULL. That resulted in not printing
the device number, and thus making the information largely useless
anyway. Also, the conditional on the rgrp being NULL can then be
removed from the tracepoint. After this change, all the block
reservation tracepoint calls will be called with the rgrp information.
The existing ins,clm and tdel calls to the block reservation tracepoint
are sufficient to track the entire life of the block reservation.
In gfs2_block_alloc() the error detection is updated to print out
the inode number of the problematic inode. This can then be compared
against the information in the glock dump,tracepoints, etc.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When we get to the stage of allocating blocks, we know that the
resource group in question must contain enough free blocks, otherwise
gfs2_inplace_reserve() would have failed. So if we are left with only
free blocks which are reserved, then we must use those. This can happen
if another node has sneeked in and use some blocks reserved on this
node, for example. Generally this will happen very rarely and only
when the resouce group is nearly full.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The ->show_options() function for GFS2 was not correctly displaying
the value when statfs slow in in use.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Milos Jakubicek <xjakub@fi.muni.cz>
Use the rbm structure for gfs2_setbit() in order to simplify the
arguments to the function. We have to add a bool to control whether
the clone bitmap should be updated (if it exists) but otherwise it
is a more or less direct substitution.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Change the arguments to gfs2_testbit() so that it now just takes an
rbm specifying the position of the two bit entry to return.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function gfs2_bitfit was checking for state > 3, but that's
impossible since it is only called from rgblk_search, which receives
only GFS2_BLKST_ constants.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function add_to_queue was checking may_grant for the passed-in
holder for every iteration of its gh2 loop. Now it only checks it
once at the beginning to see if a try lock is futile.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function gfs2_glock_dq_wait called two-line function wait_on_demote,
so they were combined.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function gfs2_glock_wait only called function wait_on_holder and
returned its return code, so they were combined for readability.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since function gfs2_glock_schedule_for_reclaim is only two
significant lines, we can eliminate it, simplifying the code
and making it more readable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes function gfs2_direct_IO so that it uses a normal
call to gfs2_glock_dq rather than a call to a multiple-dq of one item.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a few small rbm related things. First, it fixes
a corner case where the rbm needs to switch bitmaps and wasn't
adjusting its buffer pointer. Second, there's a white space issue
fixed. Third, the logic in function gfs2_rbm_from_block was optimized
a bit. Lastly, a check for goal block overflows was added to function
gfs2_alloc_blocks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
One corner case which the original patch failed to take into
account was when there is a reservation which ended such that
the following block was one beyond the end of the rgrp in
question. This extra test fixes that case.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>