The ticket ID is needed to uniquely identify transactions when doing busy
extent matching. Delayed logging changes the lifecycle of busy extents with
respect to the transaction structure lifecycle. Hence we can no longer use
the transaction structure as a means of determining the owner of the busy
extent as it may be freed and reused while the busy extent is still active.
This commit provides the infrastructure to access the xlog_tid_t held in the
ticket from a transaction handle. This avoids the need for callers to peek
into the transaction and log structures to find this out.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Push the error message output when a ticket overrun is detected
into the ticket printing functions. Also remove the debug version
of the code as the production version will still panic just as
effectively on a debug kernel via the panic mask being set.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Clean up the buffer log format (XFS_BLI_*) flags because they have a
polluted namespace. They XFS_BLI_ prefix is used for both in-memory
and on-disk flag feilds, but have overlapping values for different
flags. Rename the buffer log format flags to use the XFS_BLF_*
prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
flags.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The buffer log item reference counts used to take referenceѕ for every
transaction, similar to the pin counting. This is symmetric (like the
pin/unpin) with respect to transaction completion, but with dleayed logging
becomes assymetric as the pinning becomes assymetric w.r.t. transaction
completion.
To make both cases the same, allow the buffer pinning to take a reference to
the buffer log item and always drop the reference the transaction has on it
when being unlocked. This is balanced correctly because the unpin operation
always drops a reference to the log item. Hence reference counting becomes
symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
result the reference counting model remain consistent between normal and
delayed logging.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Delayed logging currently requires ticket allocation to succeed, so
we need to be able to sleep on allocation. It also should not allow
memory allocation to recurse into the filesystem. hence we need to
pass allocation flags directing the type of allocation the caller
requires.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The transaction ID is written into the log as the unique identifier
for transactions during recover. When duplicating a transaction, we
reuse the log ticket, which means it has the same transaction ID as
the previous transaction.
Rather than regenerating a random transaction ID for the duplicated
transaction, just add one to the current ID so that duplicated
transaction can be easily spotted in the log and during recovery
during problem diagnosis.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
uml: Pushdown the bkl from harddog_kern ioctl
sunrpc: Pushdown the bkl from sunrpc cache ioctl
sunrpc: Pushdown the bkl from ioctl
autofs4: Pushdown the bkl from ioctl
uml: Convert to unlocked_ioctls to remove implicit BKL
ncpfs: BKL ioctl pushdown
coda: Clean-up whitespace problems in pioctl.c
coda: BKL ioctl pushdown
drivers: Push down BKL into various drivers
isdn: Push down BKL into ioctl functions
scsi: Push down BKL into ioctl functions
dvb: Push down BKL into ioctl functions
smbfs: Push down BKL into ioctl function
coda/psdev: Remove BKL from ioctl function
um/mmapper: Remove BKL usage
sn_hwperf: Kill BKL usage
hfsplus: Push down BKL into ioctl function
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: confusion between kmap() and kmap_atomic() api
exofs: Add default address_space_operations
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
fat: convert to unlocked_ioctl
fat: Cleanup nls_unload() usage
fat: use pack_hex_byte() instead of custom one
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits)
ceph: reuse mon subscribe message instead of allocated anew
ceph: avoid resending queued message to monitor
ceph: Storage class should be before const qualifier
ceph: all allocation functions should get gfp_mask
ceph: specify max_bytes on readdir replies
ceph: cleanup pool op strings
ceph: Use kzalloc
ceph: use common helper for aborted dir request invalidation
ceph: cope with out of order (unsafe after safe) mds reply
ceph: save peer feature bits in connection structure
ceph: resync headers with userland
ceph: use ceph. prefix for virtual xattrs
ceph: throw out dirty caps metadata, data on session teardown
ceph: attempt mds reconnect if mds closes our session
ceph: clean up send_mds_reconnect interface
ceph: wait for mds OPEN reply to indicate reconnect success
ceph: only send cap releases when mds is OPEN|HUNG
ceph: dicard cap releases on mds restart
ceph: make mon client statfs handling more generic
ceph: drop src address(es) from message header [new protocol feature]
...
We should be checking for the ownership of the file for which
flags are being set, rather than just for write access.
Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
UFS quota is non-functional at least since 2.6.12 because dq_op was set
to NULL. Since the filesystem exists mainly to allow cooperation with Solaris
and quota format isn't standard, just remove the dead code.
CC: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
Quota on UDF is non-functional at least since 2.6.16 (I'm too lazy to
do more archeology) because it does not provide .quota_write and .quota_read
functions and thus quotaon(8) just returns EINVAL. Since nobody complained
for all those years and quota support is not even in UDF standard just nuke
it.
Signed-off-by: Jan Kara <jack@suse.cz>
Follow the dquot_* style used elsewhere in dquot.c.
[Jan Kara: Fixed up missing conversion of ext2]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Only set the quota operation vectors if the filesystem actually supports
quota instead of doing it for all filesystems in alloc_super().
[Jan Kara: Export dquot_operations and vfs_quotactl_ops]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Remount handling has fully moved into the filesystem, so all this is
superflous now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently the VFS calls into the quotactl interface for unmounting
filesystems. This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount. Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.
Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem. But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly. Also rename vfs_quota_disable to
dquot_disable while we're at it.
[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw. Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself. This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We also have to cancel quota syncing thread on remount read only because
at that moment quota is being turned off. Otherwise quota syncing thread
will try to access already freed quota structures.
Signed-off-by: Jan Kara <jack@suse.cz>
Only read potentially matching names into the target buffer, all
obviously non matching names don't need to be read into the
target buffer.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
This patch removes a redundant fid clone on the directory fid and hence
reduces a server transaction while creating new filesystem object.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Without this patch, an attempt to mksock will get an EINVAL.
Before this patch:
[root@localhost 1dir]# mksock mysock
mksock: error making mysock: Invalid argument
With this patch:
[root@localhost 1dir]# mksock mysock
[root@localhost 1dir]# ls -l mysock
s--------- 1 root root 0 2010-03-31 17:44 mysock
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
For lookup if we get ENOENT error from the server we still
instantiate the dentry. We need to make sure we have dentry
operations set in that case so that a later dput on the dentry
does the expected. Without the patch we get the below error
#ln -sf abc abclink
ln: creating symbolic link `abclink': No such file or directory
Now on the host do
$ touch abclink
Guest now gives ENOENT error.
# ls
ls: cannot access abclink: No such file or directory
Debugged-by:Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Pushdown the bkl to autofs4_root_ioctl.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Autofs <autofs@linux.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
Use the same message, allocated during startup. No need to reallocate a
new one each time around (and potentially ENOMEM).
Signed-off-by: Sage Weil <sage@newdream.net>
Don't put struct file on the stack as it takes up quite a lot of space
and violates lifetime rules for struct file.
Rather than calling afs_readpage() indirectly from the directory routines by
way of read_mapping_page(), split afs_readpage() to have afs_page_filler()
that's given a key instead of a file and call read_cache_page(), specifying the
new function directly. Use it in afs_readpages() as well.
Also make use of this in afs_mntpt_check_symlink() too for the same reason.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
This is a seriously simplified patch from Eric Sandeen; copy of
rationale follows:
===
mounting stacked ecryptfs on ecryptfs has been shown to lead to bugs
in testing. For crypto info in xattr, there is no mechanism for handling
this at all, and for normal file headers, we run into other trouble:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa015b0b3>] ecryptfs_d_revalidate+0x43/0xa0 [ecryptfs]
...
There doesn't seem to be any good usecase for this, so I'd suggest just
disallowing the configuration.
Based on a patch originally, I believe, from Mike Halcrow.
===
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- seems what ramfs_get_inode is only locally, make it static.
[AV: the hell it is; it's used by shmem, so shmem needed conversion too
and no, that function can't be made static]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This fixes sparse noise:
error: dubious one-bit signed bitfield
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
update the mnt of the path when it is not equal to the new one.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.31-2-generic #14~rbd3
> ---------------------------------------------
> firefox-3.5/4162 is trying to acquire lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
>
> but task is already holding lock:
> (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
>
> other info that might help us debug this:
> 3 locks held by firefox-3.5/4162:
> #0: (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
> #1: (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [<ffffffff81139d5a>] lock_rename+0x6a/0xf0
> #2: (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [<ffffffff81139d6f>] lock_rename+0x7f/0xf0
>
> stack backtrace:
> Pid: 4162, comm: firefox-3.5 Tainted: G C 2.6.31-2-generic #14~rbd3
> Call Trace:
> [<ffffffff8108ae74>] print_deadlock_bug+0xf4/0x100
> [<ffffffff8108ce26>] validate_chain+0x4c6/0x750
> [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
> [<ffffffff8108d585>] lock_acquire+0xa5/0x150
> [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
> [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
> [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
> [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
> [<ffffffff8120eaf9>] ? ecryptfs_rename+0x99/0x170
> [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
> [<ffffffff81139d31>] lock_rename+0x41/0xf0
> [<ffffffff8120eb2a>] ecryptfs_rename+0xca/0x170
> [<ffffffff81139a9e>] vfs_rename_dir+0x13e/0x160
> [<ffffffff8113ac7e>] vfs_rename+0xee/0x290
> [<ffffffff8113c212>] ? __lookup_hash+0x102/0x160
> [<ffffffff8113d512>] sys_renameat+0x252/0x280
> [<ffffffff81133eb4>] ? cp_new_stat+0xe4/0x100
> [<ffffffff8101316a>] ? sysret_check+0x2e/0x69
> [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
> [<ffffffff8113d55b>] sys_rename+0x1b/0x20
> [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
The trace above is totally reproducible by doing a cross-directory
rename on an ecryptfs directory.
The issue seems to be that sys_renameat() does lock_rename() then calls
into the filesystem; if the filesystem is ecryptfs, then
ecryptfs_rename() again does lock_rename() on the lower filesystem, and
lockdep can't tell that the two s_vfs_rename_mutexes are different. It
seems an annotation like the following is sufficient to fix this (it
does get rid of the lockdep trace in my simple tests); however I would
like to make sure I'm not misunderstanding the locking, hence the CC
list...
Signed-off-by: Roland Dreier <rdreier@cisco.com>
Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ADDPART_FLAG_RAID was introduced in commit d18d768, and most places were
converted to use it instead of a hardcoded value. However, some places seem
to have been missed.
Change all of them to the symbolic names via the following semantic patch:
@@
struct parsed_partitions *state;
expression E;
@@
(
- state->parts[E].flags = 1
+ state->parts[E].flags = ADDPART_FLAG_RAID
|
- state->parts[E].flags |= 1
+ state->parts[E].flags |= ADDPART_FLAG_RAID
|
- state->parts[E].flags = 2
+ state->parts[E].flags = ADDPART_FLAG_WHOLEDISK
|
- state->parts[E].flags |= 2
+ state->parts[E].flags |= ADDPART_FLAG_WHOLEDISK
)
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.
The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of just looking up a path use do_filp_open to get us a file
structure for the nfs4 recovery directory. This allows us to get
rid of the last non-standard vfs_fsync caller with a NULL file
pointer.
[AV: should be using fput(), not filp_close()]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using atomic_inc_return in __iget(struct inode *inode) makes the intent
of this code clearer and generates less code on processors that have
this operation.
On x86_64 this patch reduces the text size of inode.o by 12 bytes.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
----
patch against 2.6.34-rc7
compiled & tested on x86_64 AMD X2
I've been running with this patch applied for several weeks with no
obvious problems.
regards
Richard
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
anon_inode_mkinode() sets inode->i_mode = S_IRUSR | S_IWUSR; This means
that (inode->i_mode & S_IFMT) == 0. This trips up some SELinux code that
needs to determine if a given inode is a regular file, a directory, etc.
The easiest solution is to just make sure that the anon_inode also sets
S_IFREG.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The entries in xattr handler table should be immutable (ie const)
like other operation tables.
Later patches convert common filesystems. Uncoverted filesystems
will still work, but will generate a compiler warning.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
letting it do all the work. But freezing is more of an fs thing, and doesn't
really have much to do with the bdev at all, all the work gets done with the
super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
This means that freezing a btrfs filesystem fails, which causes us to corrupt
with things like tux-on-ice which use the fsfreeze mechanism. So instead of
populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
freezing stuff into freeze_super and thaw_super. These just take the
super_block that we're freezing and does the appropriate work. It's basically
just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
use the new super helpers. I've tested this with ext4 and btrfs and verified
everything continues to work the same as before.
The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
the fs is already frozen. I thought this was a better solution than adding a
freeze counter to the super_block, but if everybody hates this idea I'm open to
suggestions. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
need list_for_each_entry_safe() here. Original didn't even
have restart logics, so if you race with umount() it blew up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
At the same time we can kill s_need_restart and local mutex in there.
__put_super() made public for a while; will be gone later.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We used to remove from s_list and s_instances at the same
time. So let's *not* do the former and skip superblocks
that have empty s_instances in the loops over s_list.
The next step, of course, will be to get rid of rescan logics
in those loops.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that s_umount is acquired *before* we drop the final
active reference; we still have the fast path (atomic_dec_unless)
and we have gotten rid of the window between the moment when
s_active hits zero and s_umount is acquired. Which simplifies
the living hell out of grab_super() and inotify pin_to_kill()
stuff.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
First of all, get_sb_nodev() grabs anon dev minor and we
never free it in ecryptfs ->kill_sb(). Moreover, on one
of the failure exits in ecryptfs_get_sb() we leak things -
it happens before we set ->s_root and ->put_super() won't
be called in that case. Solution: kill ->put_super(), do
all that stuff in ->kill_sb(). And use kill_anon_sb() instead
of generic_shutdown_super() to deal with anon dev leak.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We set the "it's dead, don't mount on it" flag _and_ do not remove it if
we turn the damn thing negative and leave it around. And if it goes
positive afterwards, well...
Fortunately, there's only one place where that needs to be caught:
only d_delete() can turn the sucker negative without immediately freeing
it; all other places that can lead to ->d_iput() call are followed by
unconditionally freeing struct dentry in question. So the fix is obvious:
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16014
Reported-by: Adam Tkac <vonsch@gmail.com>
Tested-by: Adam Tkac <vonsch@gmail.com>
Cc: <stable@kernel.org> [2.6.34.x]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The auth_reply handler will (re)send any pending requests. For the
initial mon authenticate phase, that's correct, but when a auth ticket
renewal races with an in-flight request, we may resend a request message
that is already in flight. Avoid this by revoking the message before
sending it.
We should also avoid resending requests at all during ticket renewal; that
will come soon.
Signed-off-by: Sage Weil <sage@newdream.net>
The C99 specification states in section 6.11.5:
The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Sage Weil <sage@newdream.net>
I made a V2 of this patch on top of my patches for VFS switches.
All the changes were due to change in some offsets.
rename - change name of file or directory
size[4] Trename tag[2] fid[4] newdirfid[4] name[s]
size[4] Rrename tag[2]
The rename message is used to change the name of a file, possibly moving it
to a new directory. The 9P wstat message can only rename a file within the
same directory.
Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
I made a V2 of this patch on top of my patches for VFS switches. The
change was adding v9fs_statfs pointer to v9fs_super_ops_dotl
instead of v9fs_super_ops.
statfs - get file system statistics
size[4] Tstatfs tag[2] fid[4]
size[4] Rstatfs tag[2] type[4] bsize[4] blocks[8] bfree[8] bavail[8]
files[8] ffree[8] fsid[8] namelen[4]
The statfs message is used to request file system information returned
by the statfs(2) system call, which is used by df(1) to report file
system and disk space usage.
Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Implements VFS switches for 9p2000.L protocol.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
We need at least two to guarantee proper POSIX behaviour, so
never allow a smaller limit than that.
Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows
root to define a sane upper limit. Make it default to 16 times the
default size, which is 16 pages.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When CONFIG_BLOCK isn't enabled:
mm/page-writeback.c: In function 'laptop_mode_timer_fn':
mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
mm/page-writeback.c:709: error: dereferencing pointer to incomplete type
Fix this by essentially eliminating the laptop sync handlers when
CONFIG_BLOCK isn't set, as most are only used from the block layer code.
The exception is laptop_sync_completion() which is used from sys_sync(),
make that an empty declaration in that case.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, native capacity unlocking is initiated only when a
recognized partition extends beyond the end of the disk. However,
there are several other unhandled cases where truncated capacity can
lead to misdetection of partitions.
* Partition table is fully beyond EOD.
* Partition table is partially beyond EOD (daisy chained ones).
* Recognized partition starts beyond EOD.
This patch updates generic partition check code such that all the
above three cases are handled too. For the first two, @state tracks
whether low level partition check code tried to read beyond EOD during
partition scan and triggers native capacity unlocking accordingly.
The third is now handled similarly to the original unlocking case.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make the following changes to partition check code.
* Add ->bdev to struct parsed_partitions.
* Introduce read_part_sector() which is a simple wrapper around
read_dev_sector() which takes struct parsed_partitions *state
instead of @bdev.
* For functions which used to take @state and @bdev, drop @bdev. For
functions which used to take @bdev, replace it with @state.
* While updating, drop superflous checks on NULL state/bdev in ldm.c.
This cleans up the API a bit and enables better handling of IO errors
during partition check as the generic partition check code now has
much better visibility into what went wrong in the low level code
paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bdops->set_capacity() is unnecessarily generic. All that's required
is a simple one way notification to lower level driver telling it to
try to unlock native capacity. There's no reason to pass in target
capacity or return the new capacity. The former is always the
inherent native capacity and the latter can be handled via the usual
device resize / revalidation path. In fact, the current API is always
used that way.
Replace ->set_capacity() with ->unlock_native_capacity() which take
only @disk and doesn't return anything. IDE which is the only current
user of the API is converted accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Device resize via ->set_capacity() can reveal new partitions (e.g. in
chained partition table formats such as dos extended parts). Restart
partition scan from the beginning after resizing a device. This
change also makes libata always revalidate the disk after resize which
makes lower layer native capacity unlocking implementation simpler and
more robust as resize can be handled in the usual path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
invalidate_bdev() should release all page cache pages which are clean
and not being used; however, if some pages are still in the percpu LRU
add caches on other cpus, those pages are considered in used and don't
get released. Fix it by calling lru_add_drain_all() before trying to
invalidate pages.
This problem was discovered while testing block automatic native
capacity unlocking. Null pages which were read before automatic
unlocking didn't get released by invalidate_bdev() and ended up
interfering with partition scan after unlocking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Calling schedule without setting the task state to non-running will
return immediately, so ensure that we set it properly and check our
sleep conditions after doing so.
This is a fixup for commit 69b62d01.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Even if the writeout itself isn't a data integrity operation, we need
to ensure that the caller doesn't drop the sb umount sem before we
have actually done the writeback.
This is a fixup for commit e913fc82.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (31 commits)
dquot: Detect partial write error to quota file in write_blk() and add printk_ratelimit for quota error messages
ocfs2: Fix lock inversion in quotas during umount
ocfs2: Use __dquot_transfer to avoid lock inversion
ocfs2: Fix NULL pointer deref when writing local dquot
ocfs2: Fix estimate of credits needed for quota allocation
ocfs2: Fix quota locking
ocfs2: Avoid unnecessary block mapping when refreshing quota info
ocfs2: Do not map blocks from local quota file on each write
quota: Refactor dquot_transfer code so that OCFS2 can pass in its references
quota: unify quota init condition in setattr
quota: remove sb_has_quota_active in get/set_info
quota: unify ->set_dqblk
quota: unify ->get_dqblk
ext3: make barrier options consistent with ext4
quota: Make quota stat accounting lockless.
suppress warning: "quotatypes" defined but not used
ext3: Fix waiting on transaction during fsync
jbd: Provide function to check whether transaction will issue data barrier
ufs: add ufs speciffic ->setattr call
BKL: Remove BKL from ext2 filesystem
...
This patch changes quota_tree.c:write_blk() to detect error caused by partial
write to quota file and add a macro to limit control printed quota error
messages so we won't fill up dmesg with a corrupted quota file.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We cannot cancel delayed work from ocfs2_local_free_info because that is called
with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to
finish. Cancel the work before acquiring dqonoff_mutex.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
dquot_transfer() acquires own references to dquots via dqget(). Thus it waits
for dq_lock which creates a lock inversion because dq_lock ranks above
transaction start but transaction is already started in ocfs2_setattr(). Fix
the problem by passing own references directly to __dquot_transfer.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
commit_dqblk() can write quota info to global file. That is actually a bad
thing to do because if we are just modifying local quota file, we are not
prepared (do not hold proper locks, do not have transaction credits) to do
a modification of the global quota file. So do not use commit_dqblk() and
instead call our writing function directly.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We were missing reservation of a journal credit for modification of quota
file inode when creating new dquot structure in the global quota file.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
OCFS2 had three issues with quota locking:
a) When reading dquot from global quota file, we started a transaction while
holding dqio_mutex which is prone to deadlocks because other paths do it
the other way around
b) During ocfs2_sync_dquot we were not protected against concurrent writers
on the same node. Because we first copy data to local buffer, a race
could happen resulting in old data being written to global quota file and
thus causing quota inconsistency after a crash.
c) ip_alloc_sem of quota files was acquired while a transaction is started
in ocfs2_quota_write which can deadlock because we first get ip_alloc_sem
and then start a transaction when extending quota files.
We fix the problem a) by pulling all necessary code to ocfs2_acquire_dquot
and ocfs2_release_dquot. Thus we no longer depend on generic dquot_acquire
to do the locking and can force proper lock ordering.
Problems b) and c) are fixed by locking i_mutex and ip_alloc_sem of
global quota file in ocfs2_lock_global_qf and removing ip_alloc_sem from
ocfs2_quota_read and ocfs2_quota_write.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The position of global quota file info does not change. So we do not have
to do logical -> physical block translation every time we reread it from
disk. Thus we can also avoid taking ip_alloc_sem.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There is no need to map offset of local dquot structure to on disk block
in each quota write. It is enough to map it just once and store the physical
block number in quota structure in memory. Moreover this simplifies locking
as we do not have to take ip_alloc_sem from quota write path.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, __dquot_transfer() acquires its own references of dquot structures
that will be put into inode. But for OCFS2, this creates a lock inversion
between dq_lock (waited on in dqget) and transaction start (started in
ocfs2_setattr). Currently, deadlock is impossible because dq_lock is acquired
only during dquot_acquire and dquot_release and we already hold a reference to
dquot structures in ocfs2_setattr so neither of these functions can be called
while we call dquot_transfer. But this is rather subtle and it is hard to teach
lockdep about it. So provide __dquot_transfer function that can be passed dquot
references directly. OCFS2 can then pass acquired dquot references directly to
__dquot_transfer with proper locking.
Signed-off-by: Jan Kara <jack@suse.cz>
Quota must being initialized if size or uid/git changes requested.
But initialization performed in two different places:
in case of i_size file system is responsible for dquot init
, but in case of uid/gid init will be called internally in
dquot_transfer().
This ambiguity makes code harder to understand.
Let's move this logic to one common helper function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The methods already do these checks, so remove them in the quotactl
implementation to allow non-VFS quota implementations to also support
these calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass the larger struct fs_disk_quota to the ->set_dqblk operation so
that the Q_SETQUOTA and Q_XSETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->set_xquota
operation. The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.
Add new fieldmask values for setting the numer of blocks and inodes
values which is required for the VFS quota, but wasn't for XFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pass the larger struct fs_disk_quota to the ->get_dqblk operation so
that the Q_GETQUOTA and Q_XGETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->get_xquota
operation. The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
ext4 was updated to accept barrier/nobarrier mount options
in addition to the older barrier=0/1. The barrier story
is complex enough, we should help people by making the options
the same at least, even if the defaults are different.
This patch allows the barrier/nobarrier mount options for ext3,
while keeping nobarrier the default.
It also unconditionally displays barrier status in show_options,
and prints a message at mount time if barriers are not enabled,
just as ext4 does.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Quota stats is mostly writable data structure. Let's alloc percpu
bucket for each value.
NOTE: dqstats_read() function is racy against dqstats_{inc,dec}
and may return inconsistent value. But this is ok since absolute
accuracy is not required.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Suppress compilation warning: "quotatypes" defined but not used.
quotatypes is used only when CONFIG_QUOTA_DEBUG or CONFIG_PRINT_QUOTA_WARNING
is/are defined.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
log_start_commit() returns 1 only when it started a transaction
commit. Thus in case transaction commit is already running, we
fail to wait for the commit to finish. Fix the issue by always
waiting for the commit regardless of the log_start_commit return
value.
Signed-off-by: Jan Kara <jack@suse.cz>
Provide a function which returns whether a transaction with given tid
will send a barrier to the filesystem device. The function will be used
by ext3 to detect whether fsync needs to send a separate barrier or not.
Signed-off-by: Jan Kara <jack@suse.cz>
generic setattr not longer responsible for quota transfer.
use ufs_setattr for all ufs's inodes.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The BKL is still used in ext2_put_super(), ext2_fill_super(), ext2_sync_fs()
ext2_remount() and ext2_write_inode(). From these calls ext2_put_super(),
ext2_fill_super() and ext2_remount() are protected against each other by
the struct super_block s_umount rw semaphore. The call in ext2_write_inode()
could only protect the modification of the ext2_sb_info through
ext2_update_dynamic_rev() against concurrent ext2_sync_fs() or ext2_remount().
ext2_fill_super() and ext2_put_super() can be left out because you need a
valid filesystem reference in all three cases, which you do not have when
you are one of these functions.
If the BKL is only protecting the modification of the ext2_sb_info it can
safely be removed since this is protected by the struct ext2_sb_info s_lock.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Add a spinlock that protects against concurrent modifications of
s_mount_state, s_blocks_last, s_overhead_last and the content of the
superblock's buffer pointed to by sbi->s_es. The spinlock is now used in
ext2_xattr_update_super_block() which was setting the
EXT2_FEATURE_COMPAT_EXT_ATTR flag on the superblock without protection
before. Likewise the spinlock is used in ext2_show_options() to have a
consistent view of the mount options.
This is a preparation patch for removing the BKL from ext2 in the next
patch.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Move ext2_write_super() out of ext2_setup_super() as a preparation for the
next patch that adds a new lock for superblock fields.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Both function originally did similar things except that ext2_sync_super()
is returning after the call to sync_dirty_buffer(sbh). Therefore this
patch adds a wait flag to tell ext2_sync_super() if it has to call
sync_dirty_buffer() to wait for in-progress I/O to finish.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Depending in the state (valid or unchecked) of the filesystem either
ext2_sync_super() or ext2_commit_super() is called. If the filesystem is
currently valid (it is checked), we first mark it unchecked and afterwards
duplicate the work that ext2_sync_super() is doing later. Therefore this
patch removes the duplicate code and calls ext2_sync_super() directly after
marking the filesystem unchecked.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
This is probably a typo since the write time should actually be updated by
ext2_sync_fs() instead of the mount time.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
ext2_sync_fs() used to duplicate the code from ext2_clear_super_error().
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently block/inode/dir counters are initialized before journal was
recovered. In fact after journal recovery this info will probably
change which results in incorrect numbers returned from statfs(2).
BUG:#15768
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch removes a useless call to brelse(bitmap_bh) since at that
point bitmap_bh is NULL and slightly cleans up bitmap_bh handling.
Signed-off-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- Skip locking if quota is dirty already.
- Return old quota state to help fs-specciffic implementation to optimize
case where quota was dirty already.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There is no point in loading bitmap for groups which are completely full.
This causes noticeable performance problems (and memory pressure) on small
systems with large full filesystem
(http://marc.info/?l=linux-ext4&m=126843108314310&w=2).
Port of the same ext3 patch.
Signed-off-by: Jan Kara <jack@suse.cz>
There is no point in loading bitmap for groups which are completely full.
This causes noticeable performance problems (and memory pressure) on small
systems with large full filesystem
(http://marc.info/?l=linux-ext4&m=126843108314310&w=2).
Jan Kara: Added a comment and changed check to use cpu-endian value.
Signed-off-by: "Frans van de Wiel" <fvdw@fvdw.eu>
Signed-off-by: Jan Kara <jack@suse.cz>
This allows bin_attr->read,write,mmap callbacks to check file specific data
(such as inode owner) as part of any privilege validation.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In Al's latest vfs tree the code is reworked and S_BIAS has been removed.
It turns out that checking to see if a super block is in the
middle of an unmount in sysfs_exit_ns is unnecessary because we
remove the super_block from the s_supers/s_instances list before
struct sysfs_super_info pointed to by sb->s_fs_info is freed.
For now just delete the unnecessary check to see if a superblock is in the
middle of an unmount, it isn't necessary with or without Al's changes
and it just causes a needless conflict.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add some in-line comments to explain the new infrastructure, which
was introduced to support sysfs directory tagging with namespaces.
I think an overall description someplace might be good too, but it
didn't really seem to fit into Documentation/filesystems/sysfs.txt,
which appears more geared toward users, rather than maintainers, of
sysfs.
(Tejun, please let me know if I can make anything clearer or failed
altogether to comment something that should be commented.)
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When removing a symlink sysfs_remove_link does not provide
enough information to figure out which tagged directory the symlink
falls in. So I need sysfs_delete_link which is passed the target
of the symlink to delete.
sysfs_rename_link is updated to call sysfs_delete_link instead
of sysfs_remove_link as we have all of the information necessary
and the callers are interesting.
Both of these functions now have enough information to find a symlink
in a tagged directory. The only restriction is that they must be called
before the target kobject is renamed or deleted. If they are called
later I loose track of which tag the target kobject was marked with
and can no longer find the old symlink to remove it.
This patch was split from an earlier patch.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I had hopped to avoid this but the bonding driver adds a file
to /sys/class/net/ and the easiest way to handle that file is
to make it untagged and to register it only once.
So relax the rules on tagged directories, and make bonding work.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.
What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.
I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.
For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.
To achieve a reace free design all tagged directories are created
and managed by sysfs itself.
Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid
- Implement mount_ns() which returns the ns of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.
Everything else is left up to sysfs and the driver layer.
For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.
Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.
The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.
Currently only directories which hold kobjects, and
symlinks are supported. There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add all of the necessary bioler plate to support
multiple superblocks in sysfs.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make devtmpfs available on (embedded) configurations without SHMEM/TMPFS,
using ramfs instead.
Saves ~15KB.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is to match ext3 behaviour. We should not allow getting of
xattrs relating to ACLs when ACLs are turned off.
Reported-by: Nate Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The previous patch I wrote for reclaiming unlinked dinodes
had some shortcomings and did not prevent all hangs.
This version is much cleaner and more logical, and has
passed very difficult testing. Sorry for the churn.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (23 commits)
nilfs2: disallow remount of snapshot from/to a regular mount
nilfs2: use huge_encode_dev/huge_decode_dev
nilfs2: update comment on deactivate_super at nilfs_get_sb
nilfs2: replace MS_VERBOSE with MS_SILENT
nilfs2: add missing initialization of s_mode
nilfs2: fix misuse of open_bdev_exclusive/close_bdev_exclusive
nilfs2: enlarge s_volume_name member in nilfs_super_block
nilfs2: use checkpoint number instead of timestamp to select super block
nilfs2: add missing endian conversion on super block magic number
nilfs2: make nilfs_sc_*_ops static
nilfs2: add kernel doc comments to persistent object allocator functions
nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info
nilfs2: remove nilfs_segctor_init() in segment.c
nilfs2: insert checkpoint number in segment summary header
nilfs2: add a print message after loading nilfs2
nilfs2: cleanup multi kmem_cache_{create,destroy} code
nilfs2: move out checksum routines to segment buffer code
nilfs2: move pointer to super root block into logs
nilfs2: change default of 'errors' mount option to 'remount-ro' mode
nilfs2: Combine nilfs_btree_release_path() and nilfs_btree_free_path()
...
* git://git.infradead.org/mtd-2.6: (154 commits)
mtd: cfi_cmdset_0002: use AMD standard command-set with Winbond flash chips
mtd: cfi_cmdset_0002: Fix MODULE_ALIAS and linkage for new 0701 commandset ID
mtd: mxc_nand: Remove duplicate NAND_CMD_RESET case value
mtd: update gfp/slab.h includes
jffs2: Stop triggering block erases from jffs2_write_super()
jffs2: Rename jffs2_erase_pending_trigger() to jffs2_dirty_trigger()
jffs2: Use jffs2_garbage_collect_trigger() to trigger pending erases
jffs2: Require jffs2_garbage_collect_trigger() to be called with lock held
jffs2: Wake GC thread when there are blocks to be erased
jffs2: Erase pending blocks in GC pass, avoid invalid -EIO return
jffs2: Add 'work_done' return value from jffs2_erase_pending_blocks()
mtd: mtdchar: Do not corrupt backing device of device node inode
mtd/maps/pcmciamtd: Fix printk format for ssize_t in debug messages
drivers/mtd: Use kmemdup
mtd: cfi_cmdset_0002: Fix argument order in bootloc warning
mtd: nand: add Toshiba TC58NVG0 device ID
pcmciamtd: add another ID
pcmciamtd: coding style cleanups
pcmciamtd: fixing obvious errors
mtd: chips: add SST39WF160x NOR-flashes
...
Trivial conflicts due to dev_node removal in drivers/mtd/maps/pcmciamtd.c
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (54 commits)
xfs: mark xfs_iomap_write_ helpers static
xfs: clean up end index calculation in xfs_page_state_convert
xfs: clean up mapping size calculation in __xfs_get_blocks
xfs: clean up xfs_iomap_valid
xfs: move I/O type flags into xfs_aops.c
xfs: kill struct xfs_iomap
xfs: report iomap_bn in block base
xfs: report iomap_offset and iomap_bsize in block base
xfs: remove iomap_delta
xfs: remove iomap_target
xfs: limit xfs_imap_to_bmap to a single mapping
xfs: simplify buffer to transaction matching
xfs: Make fiemap work in query mode.
xfs: kill off l_sectbb_mask
xfs: record log sector size rather than log2(that)
xfs: remove dead XFS_LOUD_RECOVERY code
xfs: removed unused XFS_QMOPT_ flags
xfs: remove a few macro indirections in the quota code
xfs: access quotainfo structure directly
xfs: wait for direct I/O to complete in fsync and write_inode
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
ocfs2: Silence a gcc warning.
ocfs2: Don't retry xattr set in case value extension fails.
ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
fs/ocfs2/dlm: Use kstrdup
fs/ocfs2/dlm: Drop memory allocation cast
Ocfs2: Optimize punching-hole code.
Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
ocfs2: Wrap signal blocking in void functions.
ocfs2/dlm: Increase o2dlm lockres hash size
ocfs2: Make ocfs2_extend_trans() really extend.
ocfs2/trivial: Code cleanup for allocation reservation.
ocfs2: make ocfs2_adjust_resv_from_alloc simple.
ocfs2: Make nointr a default mount option
ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2net: log socket state changes
ocfs2: print node # when tcp fails
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1674 commits)
qlcnic: adding co maintainer
ixgbe: add support for active DA cables
ixgbe: dcb, do not tag tc_prio_control frames
ixgbe: fix ixgbe_tx_is_paused logic
ixgbe: always enable vlan strip/insert when DCB is enabled
ixgbe: remove some redundant code in setting FCoE FIP filter
ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp
ixgbe: fix header len when unsplit packet overflows to data buffer
ipv6: Never schedule DAD timer on dead address
ipv6: Use POSTDAD state
ipv6: Use state_lock to protect ifa state
ipv6: Replace inet6_ifaddr->dead with state
cxgb4: notify upper drivers if the device is already up when they load
cxgb4: keep interrupts available when the ports are brought down
cxgb4: fix initial addition of MAC address
cnic: Return SPQ credit to bnx2x after ring setup and shutdown.
cnic: Convert cnic_local_flags to atomic ops.
can: Fix SJA1000 command register writes on SMP systems
bridge: fix build for CONFIG_SYSFS disabled
ARCNET: Limit com20020 PCI ID matches for SOHARD cards
...
Fix up various conflicts with pcmcia tree drivers/net/
{pcmcia/3c589_cs.c, wireless/orinoco/orinoco_cs.c and
wireless/orinoco/spectrum_cs.c} and feature removal
(Documentation/feature-removal-schedule.txt).
Also fix a non-content conflict due to pm_qos_requirement getting
renamed in the PM tree (now pm_qos_request) in net/mac80211/scan.c
This patch modifies the fs/timerfd.c to use the newly created
wait_event_interruptible_locked_irq() macro. This replaces an open
code implementation with a single macro call.
Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (61 commits)
KEYS: Return more accurate error codes
LSM: Add __init to fixup function.
TOMOYO: Add pathname grouping support.
ima: remove ACPI dependency
TPM: ACPI/PNP dependency removal
security/selinux/ss: Use kstrdup
TOMOYO: Use stack memory for pending entry.
Revert "ima: remove ACPI dependency"
Revert "TPM: ACPI/PNP dependency removal"
KEYS: Do preallocation for __key_link()
TOMOYO: Use mutex_lock_interruptible.
KEYS: Better handling of errors from construct_alloc_key()
KEYS: keyring_serialise_link_sem is only needed for keyring->keyring links
TOMOYO: Use GFP_NOFS rather than GFP_KERNEL.
ima: remove ACPI dependency
TPM: ACPI/PNP dependency removal
selinux: generalize disabling of execmem for plt-in-heap archs
LSM Audit: rename LSM_AUDIT_NO_AUDIT to LSM_AUDIT_DATA_NONE
CRED: Holding a spinlock does not imply the holding of RCU read lock
SMACK: Don't #include Ext2 headers
...
* 'for-2.6.35' of git://linux-nfs.org/~bfields/linux: (45 commits)
Revert "nfsd4: distinguish expired from stale stateids"
nfsd: safer initialization order in find_file()
nfs4: minor callback code simplification, comment
NFSD: don't report compiled-out versions as present
nfsd4: implement reclaim_complete
nfsd4: nfsd4_destroy_session must set callback client under the state lock
nfsd4: keep a reference count on client while in use
nfsd4: mark_client_expired
nfsd4: introduce nfs4_client.cl_refcount
nfsd4: refactor expire_client
nfsd4: extend the client_lock to cover cl_lru
nfsd4: use list_move in move_to_confirmed
nfsd4: fold release_session into expire_client
nfsd4: rename sessionid_lock to client_lock
nfsd4: fix bare destroy_session null dereference
nfsd4: use local variable in nfs4svc_encode_compoundres
nfsd: further comment typos
sunrpc: centralise most calls to svc_xprt_received
nfsd4: fix unlikely race in session replay case
nfsd4: fix filehandle comment
...
* 'nfs-for-2.6.35' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (78 commits)
SUNRPC: Don't spam gssd with upcall requests when the kerberos key expired
SUNRPC: Reorder the struct rpc_task fields
SUNRPC: Remove the 'tk_magic' debugging field
SUNRPC: Move the task->tk_bytes_sent and tk_rtt to struct rpc_rqst
NFS: Don't call iput() in nfs_access_cache_shrinker
NFS: Clean up nfs_access_zap_cache()
NFS: Don't run nfs_access_cache_shrinker() when the mask is GFP_NOFS
SUNRPC: Ensure rpcauth_prune_expired() respects the nr_to_scan parameter
SUNRPC: Ensure memory shrinker doesn't waste time in rpcauth_prune_expired()
SUNRPC: Dont run rpcauth_cache_shrinker() when gfp_mask is GFP_NOFS
NFS: Read requests can use GFP_KERNEL.
NFS: Clean up nfs_create_request()
NFS: Don't use GFP_KERNEL in rpcsec_gss downcalls
NFSv4: Don't use GFP_KERNEL allocations in state recovery
SUNRPC: Fix xs_setup_bc_tcp()
SUNRPC: Replace jiffies-based metrics with ktime-based metrics
ktime: introduce ktime_to_ms()
SUNRPC: RPC metrics and RTT estimator should use same RTT value
NFS: Calldata for nfs4_renew_done()
NFS: Squelch compiler warning in nfs_add_server_stats()
...
* 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
sunrpc: Include missing smp_lock.h
procfs: Kill the bkl in ioctl
procfs: Push down the bkl from ioctl
procfs: Use generic_file_llseek in /proc/vmcore
procfs: Use generic_file_llseek in /proc/kmsg
procfs: Use generic_file_llseek in /proc/kcore
procfs: Kill BKL in llseek on proc base
This is the culmination of this sequence of patches. By moving the block
erasing from jffs2_write_super() into the GC code, we avoid huge
latencies on unmount where it waits for _all_ pending blocks to be
erased, and we allow better control for time-critical tasks by stopping
the GC thread.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Now that we do erases from GC and trigger the GC thread to do them
instead of using kupdated, this function is misnamed. It's only used
for triggering wbuf flush on NAND flash now. Rename it accordingly.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We're about to call this from a bunch of places which already hold
c->erase_completion_lock, so add an assertion and change its existing
callers to do the same.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Now that we trigger block erases from jffs2_garbage_collect_pass(),
adjust jffs2_thread_should_wake() to return 1 when there are blocks to
erase.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
jffs2_garbage_collect_pass() would previously return -EAGAIN if it
couldn't find anything to garbage collect from, and there were blocks on
the erase_pending_list. If the blocks were actually in the process of
being erased, though, then they wouldn't be on that list. Check for
nr_erasing_blocks being non-zero instead.
Fix jffs2_reserve_space() to wait for the in-progress erases to
complete, when jffs2_garbage_collect_pass() returns -EAGAIN.
And fix jffs2_erase_succeeded() to actually wake up the erase_wait wq
that jffs2_reserve_space() is now using.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We're about to start calling this from the jffs2_garbage_collect_pass(), and
we'll want to know whether it actually did anything or not.
Signed-off-by: Joakim Tjernlund <joakim.tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Rename all iomap_valid identifiers to imap_valid to fit the new
world order, and clean up xfs_iomap_valid to convert the passed in
offset to blocks instead of the imap values to bytes. Use the
simpler inode->i_blkbits instead of the XFS macros for this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The IOMAP_ flags are now only used inside xfs_aops.c for extent
probing and I/O completion tracking, so more them here, and rename
them to IO_* as there's no mapping involved at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that struct xfs_iomap contains exactly the same units as struct
xfs_bmbt_irec we can just use the latter directly in the aops code.
Replace the missing IOMAP_NEW flag with a new boolean output
parameter to xfs_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Report the iomap_bn field of struct xfs_iomap in terms of filesystem
blocks instead of in terms of bytes. Shift the byte conversions
into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag
checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Report the iomap_offset and iomap_bsize fields of struct xfs_iomap
in terms of fsblocks instead of in terms of disk blocks. Shift the
byte conversions into the callers temporarily, but they will
disappear or get cleaned up later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The iomap_delta field in struct xfs_iomap just contains the
difference between the offset passed to xfs_iomap and the
iomap_offset. Just calculate it in the only caller that cares.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Instead of using the iomap_target field in struct xfs_iomap
and the IOMAP_REALTIME flag just use the already existing
xfs_find_bdev_for_inode helper. There's some fallout as we
need to pass the inode in a few more places, which we also
use to sanitize some calling conventions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We only call xfs_iomap for single mappings anyway, so remove all
code dealing with multiple mappings from xfs_imap_to_bmap and add
asserts that we never get results that we do not expect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We currenly have a routine xfs_trans_buf_item_match_all which checks
if any log item in a transaction contains a given buffer, and a
second one that only does this check for the first, embedded chunk
of log items. We only use the second routine if we know we only
have that log item chunk, so get rid of the limited routine and
always use the more complete one.
Also rename the old xfs_trans_buf_item_match_all to
xfs_trans_buf_item_match and update various surrounding comments,
and move the remaining xfs_trans_buf_item_match on top of the file
to avoid a forward prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
According to Documentation/filesystems/fiemap.txt, If fm_extent_count
is zero, then the fm_extents[] array is ignored (no extents will be
returned), and the fm_mapped_extents count will hold the number of
extents needed.
But as the commit 97db39a1f6 has changed
bmv_count to the caller's input buffer, this number query function can't
work any more. As this commit is written to change bmv_count from
MAXEXTNUM because of ENOMEM.
This patch just try to set bm.bmv_count to something sane.
Thanks to Dave Chinner <david@fromorbit.com> for the suggestion.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
There remains only one user of the l_sectbb_mask field in the log
structure. Just kill it off and compute the mask where needed from
the power-of-2 sector size.
(Only update from last post is to accomodate the changes in the
previous patch in the series.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Change struct log so it keeps track of the size (in basic blocks) of
a log sector in l_sectBBsize rather than the log-base-2 of that
value (previously, l_sectbb_log). The name was chosen for
consistency with the other fields in the structure that represent
a number of basic blocks.
(Updated so that a variable used in computing and verifying a log's
sector size is named "log2_size". Also added the "BB" to the
structure field name, based on feedback from Eric Sandeen. Also
dropped some superfluous parentheses.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
This can't be enabled through the build system and has been dead for
ages. Note that the CRC patches add back log checksumming, but the
code is quite different from the version removed here anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Access fields in m_quotainfo directly instead of hiding them behind the
XFS_QI_* macros. Add local variables for the quotainfo pointer in places
where we have lots of them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We need to wait for all pending direct I/O requests before taking care of
metadata in fsync and write_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
fs/xfs/linux-2.6/xfs_trace.c: xfs_attr_sf.h is included more than once.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
Odds and ends in "xfs_log_recover.c". This patch just contains some
minor things that didn't seem to warrant their own individual
patches:
- In xlog_bread_noalign(), drop an assertion that a pointer is
non-null (the crash will tell us it was a bad pointer).
- Add a more descriptive header comment for xlog_find_verify_cycle().
- Make a few additions to the comments in xlog_find_head(). Also
rearrange some expressions in a few spots to produce the same
result, but in a way that seems more clear what's being computed.
(Updated in response to Dave's review comments. Note I did not
split this patch like I said I would.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In xlog_find_cycle_start() use a local variable for some repeated
operations rather than constantly accessing the memory location
whose address is passed in.
(This version drops an assertion that a pointer is non-null.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rename a label used in xlog_find_head() that I thought was poorly
chosen. Also combine two adjacent labels xlog_find_tail() into a
single label, and give it a more generic name.
(Now using Dave's suggested "validate_head" name for first label.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_bwrite is used with the intention of synchronously writing out
buffers, but currently it does not actually clear the async flag if
that's left from previous writes but instead implements async
behaviour if it finds it. Remove the code handling asynchronous
writes as we've got rid of those entirely outside of the log and
delwri buffers, and make sure that we clear the async and read flags
before writing the buffer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
All modifications to the superblock are done transactional through
xfs_trans_log_buf, so there is no reason to initiate periodic
asynchronous writeback. This only removes the superblock from the
delwri list and will lead to sub-optimal I/O scheduling.
Cut down xfs_sync_fsdata now that it's only used for synchronous
superblock writes and move the log coverage checks into the two
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The transaction ID that is written to the log for a transaction is
currently set by taking the lower 32 bits of the memory address of
the ticket structure. This is not guaranteed to be unique as
tickets comes from a slab and slots can be reallocated immediately
after being freed. As a result, there is no guarantee of uniqueness
in the ticket ID value.
Fix this by assigning a random number to the ticket ID field so that
it is extremely unlikely that duplicates will occur and remove the
possibility of transactions being mixed up during recovery due to
duplicate IDs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are a number of places where a log sector size of 1 uses
special case code. The round_up() and round_down() macros
produce the correct result even when the log sector size is 1, and
this eliminates the need for treating this as a special case.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Define a function that encapsulates checking the validity of a log
block count.
(Updated from previous version--no longer includes error reporting in the
encapsulated validation function.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
XLOG_SECTOR_ROUNDUP_BBCOUNT() and XLOG_SECTOR_ROUNDDOWN_BLKNO()
are now fairly simple macro translations. Just get rid of them in
favor of the round_up() and round_down() macro calls they represent.
Also, in spots in xlog_get_bp() and xlog_write_log_records(),
round_up() was being called with value 1, which just evaluates
to the macro's second argument; so just use that instead.
In the latter case, make use of that value, as long as it's
already been computed.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
XLOG_SECTOR_ROUNDUP_BBCOUNT() is defined in "fs/xfs/xfs_log_recover.c"
in an overly-complicated way. It is basically roundup(), but that
is not at all clear from its definition. (Actually, there is
another macro round_up() that applies for power-of-two-based masks
which I'll be using here.)
The operands in XLOG_SECTOR_ROUNDUP_BBCOUNT() are basically the
block number (bbs) and the log sector basic block mask
(log->l_sectbb_mask). I'll call them B and M for this discussion.
The macro computes is value this way:
M && (B & M) ? (B + M + 1) & ~M : B
Put another way, we can break it into 3 cases:
1) ! M -> B # 0 mask, no effect
2) ! (B & M) -> B # sector aligned
3) M && (B & M) -> (B + M + 1) & ~M # round up otherwise
The round_up() macro is cleverly defined using a value, v, and a
power-of-2, p, and the result is the nearest multiple of p greater
than or equal to v. Its value is computed something like this:
((v - 1) | (p - 1)) + 1
Let's consider using this in the context of the 3 cases above.
When p = 2^0 = 1, the result boils down to ((v - 1) | 0) + 1, so it
just translates any value v to itself. That handles case (1) above.
When p = 2^n, n > 0, we know that (p - 1) will be a mask with all n
bits 0..n-1 set. The condition in this case occurs when none of
those mask bits is set in the value v provided. If that is the
case, subtracting 1 from v will have 1's in all those lower bits (at
least). Therefore, OR-ing the mask with that decremented value has
no effect, so adding the 1 back again will just translate the v to
itself. This handles case (2).
Otherwise, the value v is greater than some multiple of p, and
decrementing it will produce a result greater than or equal to that
multiple. OR-ing in the mask will produce a value 1 less than the
next multiple of p, so finally adding 1 back will result in the
desired rounded-up value. This handles case (3).
Hopefully this is convincing.
While I was at it, I converted XLOG_SECTOR_ROUNDDOWN_BLKNO() to use
the round_down() macro.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
This fixes a bug in two places that I found by inspection. In
xlog_find_verify_cycle() and xlog_write_log_records(), the code
attempts to allocate a buffer to hold as many blocks as possible.
It gives up if the number of blocks to be allocated gets too small.
Right now it uses log->l_sectbb_log as that lower bound, but I'm
sure it's supposed to be the actual log sector size instead. That
is, the lower bound should be (1 << log->l_sectbb_log).
Also define a simple macro xlog_sectbb(log) to represent the number
of basic blocks in a sector for the given log.
(No change from original submission; I have implemented Christoph's
suggestion about storing l_sectsize rather than l_sectbb_log in
a new, separate patch in this series.)
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Change the tag and file name arguments to xfs_error_report() and
xfs_corruption_error() to use a const qualifier.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
The xfs_dqmarker structure does not need to exist anymore. Move the
remaining flags field out of it and remove the structure altogether.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Convert the dquot free list on the filesystem to use listhead
infrastructure rather than the roll-your-own in the quota code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the dquot hash list on the filesystem to use listhead
infrastructure rather than the roll-your-own in the quota code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The dquot shaker and the free-list reclaim code use exactly the same
algorithm but the code is duplicated and slightly different in each
case. Make the shaker code use the single dquot reclaim code to
remove the code duplication.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert the dquot list on the filesytesm to use listhead
infrastructure rather than the roll-your-own in the quota code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently there is no tracing in log recovery, so it is difficult to
determine what is going on when something goes wrong.
Add tracing for log item recovery to provide visibility into the log
recovery process. The tracing added shows regions being extracted
from the log transactions and added to the transaction hash forming
recovery items, followed by the reordering, cancelling and finally
recovery of the items.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Replace the awkward xlog_write_adv_cnt with an inline helper that makes
it more obvious that it's modifying it's paramters, and replace the use
of an integer type for "ptr" with a real void pointer. Also move
xlog_write_adv_cnt to xfs_log_priv.h as it will be used outside of
xfs_log.c in the delayed logging series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The current log IO vector structure is a flat array and not
extensible. To make it possible to keep separate log IO vectors for
individual log items, we need a method of chaining log IO vectors
together.
Introduce a new log vector type that can be used to wrap the
existing log IO vectors on use that internally to the log. This
means that the existing external interface (xfs_log_write) does not
change and hence no changes to the transaction commit code are
required.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reindent xlog_write to normal one tab indents and move all variable
declarations into the closest enclosing block.
Split from a bigger patch by Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xlog_write is a mess that takes a lot of effort to understand. It is
a mass of nested loops with 4 space indents to get it to fit in 80 columns
and lots of funky variables that aren't obvious what they mean or do.
Break it down into understandable chunks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When allocation a ticket for a transaction, the ticket is initialised with the
worst case log space usage based on the number of bytes the transaction may
consume. Part of this calculation is the number of log headers required for the
iclog space used up by the transaction.
This calculation makes an undocumented assumption that if the transaction uses
the log header space reservation on an iclog, then it consumes either the
entire iclog or it completes. That is - the transaction that is first in an
iclog is the transaction that the log header reservation is accounted to. If
the transaction is larger than the iclog, then it will use the entire iclog
itself. Document this assumption.
Further, the current calculation uses the rule that we can fit iclog_size bytes
of transaction data into an iclog. This is in correct - the amount of space
available in an iclog for transaction data is the size of the iclog minus the
space used for log record headers. This means that the calculation is out by
512 bytes per 32k of log space the transaction can consume. This is rarely an
issue because maximally sized transactions are extremely uncommon, and for 4k
block size filesystems maximal transaction reservations are about 400kb. Hence
the error in this case is less than the size of an iclog, so that makes it even
harder to hit.
However, anyone using larger directory blocks (16k directory blocks push the
maximum transaction size to approx. 900k on a 4k block size filesystem) or
larger block size (e.g. 64k blocks push transactions to the 3-4MB size) could
see the error grow to more than an iclog and at this point the transaction is
guaranteed to get a reservation underrun and shutdown the filesystem.
Fix this by adjusting the calculation to calculate the correct number of iclogs
required and account for them all up front.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that the code has been factored, clean up all the remaining
style cruft, simplify the code and re-order functions so that it
doesn't need forward declarations.
Also move the remaining functions that require forward declarations
(xfs_trans_uncommit, xfs_trans_free) so that all the forward
declarations can be removed from the file.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The function header to xfs-trans_committed has long had this
comment:
* THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item()
To prepare for different methods of committing items, convert the
code to use xfs_trans_next_item() and factor the code into smaller,
more digestible chunks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
> +shut_us_down:
> + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0;
> + if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) {
> + xfs_trans_unreserve_and_mod_sb(tp);
> + /*
This whole area in _xfs_trans_commit is still a complete mess.
So while touching this code, unravel this mess as well to make the
whole flow of the function simpler and clearer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Split the the part of xfs_trans_commit() that deals with writing the
transaction into the iclog into a separate function. This isolates the
physical commit process from the logical commit operation and makes
it easier to insert different transaction commit paths without affecting
the existing algorithm adversely.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_bmap_add_attrfork() passes XFS_TRANS_PERM_LOG_RES to xfs_trans_commit()
to indicate that the commit should release the permanent log reservation
as part of the commit. This is wrong - the correct flag is
XFS_TRANS_RELEASE_LOG_RES - and it is only by the chance that both these
flags have the value of 0x4 that the code is doing the right thing.
Fix it by changing to use the correct flag.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The staleness of a object being unpinned can be directly derived
from the object itself - there is no need to extract it from the
object then pass it as a parameter into IOP_UNPIN().
This means we can kill the XFS_LID_BUF_STALE flag - it is set,
checked and cleared in the same places XFS_BLI_STALE flag in the
xfs_buf_log_item so it is now redundant and hence safe to remove.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We don't record pin counts in inode events right now, and this makes
it difficult to track down problems related to pinning inodes. Add
the pin count to the inode trace class and add trace events for
pinning and unpinning inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Each log item type does manual initialisation of the log item.
Delayed logging introduces new fields that need initialisation, so
factor all the open coded initialisation into a common function
first.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This allows to see in `ps` and similar tools which kthreads are
allotted to which block device/filesystem, similar to what jbd2
does. As the process name is a fixed 16-char array, no extra
space is needed in tasks.
PID TTY STAT TIME COMMAND
2 ? S 0:00 [kthreadd]
197 ? S 0:00 \_ [jbd2/sda2-8]
198 ? S 0:00 \_ [ext4-dio-unwrit]
204 ? S 0:00 \_ [flush-8:0]
2647 ? S 0:00 \_ [xfs_mru_cache]
2648 ? S 0:00 \_ [xfslogd/0]
2649 ? S 0:00 \_ [xfsdatad/0]
2650 ? S 0:00 \_ [xfsconvertd/0]
2651 ? S 0:00 \_ [xfsbufd/ram0]
2652 ? S 0:00 \_ [xfsaild/ram0]
2653 ? S 0:00 \_ [xfssyncd/ram0]
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface
is not bounded correctly. The opcount is used to determine the size
of the buffer required. The size is bounded, but can overflow and so
the size checks may not be sufficient to catch invalid opcounts.
Fix it by catching opcount values that would cause overflows before
calculating the size.
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
ocfs2_block_group_claim_bits() is never called with min_bits=0, but we
shouldn't leave status undefined if it ever is.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In normal xattr set, the set sequence is inode, xattr block
and finally xattr bucket if we meet with a ENOSPC. But there
is a corner case.
So consider we will set a xattr whose value will be stored in
a cluster, and there is no xattr block by now. So we will
reserve 1 xattr block and 1 cluster for setting it. Now if we
fail in value extension(in case the volume is almost full and
we can't allocate the cluster because the check in
ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So
we will try to create a bucket(this time there is a chance that
the reserved cluster will be used), and when we try value extension
again, kernel bug happens. We did meet with it. Check the bug below.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251
This patch just try to avoid this by adding a set_abort in
ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension,
we will check whether it is caused by the real ENOSPC or just the
full of inode or xattr block. If it is the first case, we set set_abort
so that we don't try any further. we are safe to exit directly here
ince it is really ENOSPC.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Currently we process a dirty lockres with the lockres->spinlock taken. While
during the process, we may need to lock on dlm->ast_lock. This breaks the
dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second).
This patch fixes the problem.
Since we can't release lockres->spinlock, we have to take dlm->ast_lock
just before taking the lockres->spinlock and release it after lockres->spinlock
is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version,
in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no
performance harm.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_prepare_xattr_entry, if we fail to grow an existing value,
xa_cleanup_value_truncate() will leave the old entry in place. Thus, we
reset its value size. However, if we were allocating a new value, we
must not reset the value size or we will BUG(). This resolves
oss.oracle.com bug 1247.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This reverts commit 78155ed75f.
We're depending here on the boot time that we use to generate the
stateid being monotonic, but get_seconds() is not necessarily.
We still depend at least on boot_time being different every time, but
that is a safer bet.
We have a few reports of errors that might be explained by this problem,
though we haven't been able to confirm any of them.
But the minor gain of distinguishing expired from stale errors seems not
worth the risk.
Conflicts:
fs/nfsd/nfs4state.c
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Use kstrdup when the goal of an allocation is copy a string into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to;
expression flag,E1,E2;
statement S;
@@
- to = kmalloc(strlen(from) + 1,flag);
+ to = kstrdup(from, flag);
... when != \(from = E1 \| to = E1 \)
if (to==NULL || ...) S
... when != \(from = E2 \| to = E2 \)
- strcpy(to, from);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Drop cast on the result of kmalloc and similar functions.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
@@
- (T *)
(\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch simplifies the logic of handling existing holes and
skipping extent blocks and removes some confusing comments.
The patch survived the fill_verify_holes testcase in ocfs2-test.
It also passed my manual sanity check and stress tests with enormous
extent records.
Currently punching a hole on a file with 3+ extent tree depth was
really a performance disaster. It can even take several hours,
though we may not hit this in real life with such a huge extent
number.
One simple way to improve the performance is quite straightforward.
From the logic of truncate, we can punch the hole from hole_end to
hole_start, which reduces the overhead of btree operations in a
significant way, such as tree rotation and moving.
Following is the testing result when punching hole from 0 to file end
in bytes, on a 1G file, 1G file consists of 256k extent records, each record
cover 4k data(just one cluster, clustersize is 4k):
===========================================================================
* Original punching-hole mechanism:
===========================================================================
I waited 1 hour for its completion, unfortunately it's still ongoing.
===========================================================================
* Patched punching-hode mechanism:
===========================================================================
real 0m2.518s
user 0m0.000s
sys 0m2.445s
That means we've gained up to 1000 times improvement on performance in this
case, whee! It's fairly cool. and it looks like that performance gain will
be raising when extent records grow.
The patch was based on my former 2 patches, which were about truncating
codes optimization and fixup to handle CoW on punching hole.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The original idea to pull ocfs2_find_cpos_for_left_leaf() out of
alloc.c is to benefit punching-holes optimization patch, it however,
can also be referred by other funcs in the future who want to do the
same job.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Based on the previous patch of optimizing truncate, the bugfix for
refcount trees when punching holes can be fairly easy
and straightforward since most of work we should take into account for
refcounting have been completed already in ocfs2_remove_btree_range().
This patch performs CoW for refcounted extents when a hole being punched
whose start or end offset were in the middle of a cluster, which means
partial zeroing of the cluster will be performed soon.
The patch has been tested fixing the following bug:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1216
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Truncate is just a special case of punching holes(from new i_size to
end), we therefore could take advantage of the existing
ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
alloc.c. The goal here is to make truncate more generic and
straightforward.
Several functions only used by ocfs2_commit_truncate() will smiply be
removed.
ocfs2_remove_btree_range() was originally used by the hole punching
code, which didn't take refcount trees into account (definitely a bug).
We therefore need to change that func a bit to handle refcount trees.
It must take the refcount lock, calculate and reserve blocks for
refcount tree changes, and decrease refcounts at the end. We replace
ocfs2_lock_allocators() here by adding a new func
ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
reserve. This will not hurt any other code using
ocfs2_remove_btree_range() (such as dir truncate and hole punching).
I merged the following steps into one patch since they may be
logically doing one thing, though I know it looks a little bit fat
to review.
1). Remove redundant code used by ocfs2_commit_truncate(), since we're
moving to ocfs2_remove_btree_range anyway.
2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
accepting some extra blocks to reserve.
3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
needs. It's safe to do this since it's only being called by
truncate.
4). Change ocfs2_remove_btree_range() a bit to take refcount case into
account.
5). Finally, we change ocfs2_commit_truncate() to call
ocfs2_remove_btree_range() in a proper way.
The patch has been tested normally for sanity check, stress tests
with heavier workload will be expected.
Based on this patch, fixing the punching holes bug will be fairly easy.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The alloc_init_file() first adds a file to the hash and then
initializes its fi_inode, fi_id and fi_had_conflict.
The uninitialized fi_inode could thus be erroneously checked by
the find_file(), so move the hash insertion lower.
The client_mutex should prevent this race in practice; however, we
eventually hope to make less use of the client_mutex, so the ordering
here is an accident waiting to happen.
I didn't find whether the same can be true for two other fields,
but the common sense tells me it's better to initialize an object
before putting it into a global hash table :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Note the position in the version array doesn't have to match the actual
rpc version number--to me it seems clearer to maintain the distinction.
Also document choice of rpc callback version number, as discussed in
e.g. http://www.ietf.org/mail-archive/web/nfsv4/current/msg07985.html
and followups.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
sched, wait: Use wrapper functions
sched: Remove a stale comment
ondemand: Make the iowait-is-busy time a sysfs tunable
ondemand: Solve a big performance issue by counting IOWAIT time as busy
sched: Intoduce get_cpu_iowait_time_us()
sched: Eliminate the ts->idle_lastupdate field
sched: Fold updating of the last_update_time_info into update_ts_time_stats()
sched: Update the idle statistics in get_cpu_idle_time_us()
sched: Introduce a function to update the idle statistics
sched: Add a comment to get_cpu_idle_time_us()
cpu_stop: add dummy implementation for UP
sched: Remove rq argument to the tracepoints
rcu: need barrier() in UP synchronize_sched_expedited()
sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
sched: kill paranoia check in synchronize_sched_expedited()
sched: replace migration_thread with cpu_stop
stop_machine: reimplement using cpu_stop
cpu_stop: implement stop_cpu[s]()
sched: Fix select_idle_sibling() logic in select_task_rq_fair()
...
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (23 commits)
cifs: fix noserverino handling when unix extensions are enabled
cifs: don't update uniqueid in cifs_fattr_to_inode
cifs: always revalidate hardlinked inodes when using noserverino
[CIFS] drop quota operation stubs
cifs: propagate cifs_new_fileinfo() error back to the caller
cifs: add comments explaining cifs_new_fileinfo behavior
cifs: remove unused parameter from cifs_posix_open_inode_helper()
[CIFS] Remove unused cifs_oplock_cachep
cifs: have decode_negTokenInit set flags in server struct
cifs: break negotiate protocol calls out of cifs_setup_session
cifs: eliminate "first_time" parm to CIFS_SessSetup
[CIFS] Fix lease break for writes
cifs: save the dialect chosen by server
cifs: change && to ||
cifs: rename "extended_security" to "global_secflags"
cifs: move tcon find/create into separate function
cifs: move SMB session creation code into separate function
cifs: track local_nls in volume info
[CIFS] Allow null nd (as nfs server uses) on create
[CIFS] Fix losing locks during fork()
...
This is essential, as for the rados block device we'll need
to run in different contexts that would need flags that
are other than GFP_NOFS.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Specify max bytes in request to bound size of reply. Add associated
mount option with default value of 512 KB.
Signed-off-by: Sage Weil <sage@newdream.net>
Use kzalloc rather than the combination of kmalloc and memset.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x,size,flags;
statement S;
@@
-x = kmalloc(size,flags);
+x = kzalloc(size,flags);
if (x == NULL) S
-memset(x, 0, size);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay. Use common helper to avoid duplicate code.
Signed-off-by: Sage Weil <sage@newdream.net>
The remove_session_caps() helper is called when an MDS closes out our
session (either normally, or as a result of a failed reconnect), and when
we tear down state for umount. If we remove the last cap, and there are
no cap migrations in progress, then there is little hope of us flushing
out that data to the mds (without heroic efforts to reconnect and flush).
So, to avoid leaving inodes pinned (due to dirty state) and crashing after
umount, throw out dirty caps state and unpin the inodes. Print a warning
to the console so we know something was lost.
NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean;
maybe a truncate should be queued?
Signed-off-by: Sage Weil <sage@newdream.net>
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.
Change client to attempt an immediate reconnect if our session is closed.
Note that currently the MDS doesn't support this, and our attempt will
fail. We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect. That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.
Signed-off-by: Sage Weil <sage@newdream.net>
Pass a ceph_mds_session, since the caller has it.
Remove the dead code for sending empty reconnects. It used to be used
when the MDS contacted _us_ to solicit a reconnect, and we could reply
saying "go away, I have no session." Now we only send reconnects based
on the mds map, and only when we do in fact have an open session.
Signed-off-by: Sage Weil <sage@newdream.net>
We used to infer reconnect success by watching the MDS state, essentially
assuming that hearing nothing meant things were ok. That wasn't
particularly reliable. Instead, the MDS replies with an explicit OPEN
message to indicate success.
Strictly speaking, this is a protocol change, but it is a backwards
compatible one that does not break new clients + old servers or old
clients + new servers. At least not yet.
Drop unused @all argument from kick_requests while we're at it.
Signed-off-by: Sage Weil <sage@newdream.net>
On OPENING we shouldn't have any caps (or releases).
On CLOSING, we should wait until we succeed (and throw it all out), or
don't (and are OPEN again).
On RECONNECTING we can wait until we are OPEN.
Signed-off-by: Sage Weil <sage@newdream.net>
If the MDS restarts, the expire caps state is no longer shared, and can be
thrown out. Caps state will be rebuilt on the MDS during the reconnect
process that follows. Zero out any release messages and adjust the
release counter accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
This is being done so that we could reuse the statfs
infrastructure with other requests that return values.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source
address in each message header (twice). This patch switches the client to
the new scheme, and _requires_ this feature on the server. The server
will support both the old and new schemes. That means an old client will
work with a new server, but a new client will not work with an old server.
Signed-off-by: Sage Weil <sage@newdream.net>
The bdi_setup_and_register() helper doesn't help us since we bdi_init() in
create_client() and bdi_register() only when sget() succeeds.
Signed-off-by: Sage Weil <sage@newdream.net>
We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry(). Notably, we should NOT assign an
offset when a dentry is first created and is still null.
BUG if we try to splice a non-null dentry (we shouldn't).
Signed-off-by: Sage Weil <sage@newdream.net>
If the version hasn't changed, don't rebuild the index.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we use the xattr_blob, clear the pointer so we don't release the memory
at the bottom of the fuction.
Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Simplify messenger locking, and close race between ceph_con_close() setting
the CLOSED bit and con_work() checking the bit, then taking the mutex.
Signed-off-by: Sage Weil <sage@newdream.net>
d_obtain_alias() doesn't return NULL, it returns an ERR_PTR().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
We only need to pass in front_len. Callers can attach any other payload
pieces (middle, data) as they see fit.
Signed-off-by: Sage Weil <sage@newdream.net>
Returning ERR_PTR(-ENOMEM) is useless extra work. Return NULL on failure
instead, and fix up the callers (about half of which were wrong anyway).
Signed-off-by: Sage Weil <sage@newdream.net>
Since we don't need to maintain large pools of messages, we can just
use the standard mempool_t. We maintain a msgpool 'wrapper' because we
need the mempool_t* in the alloc function, and mempool gives us only
pool_data.
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.
-static inline struct ceph_client *ceph_client(struct super_block *sb)
-{
- return sb->s_fs_info;
-}
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This would only trigger if we bailed out before resetting r_con_filling_msg
because the server reply was corrupt (oversized).
Signed-off-by: Sage Weil <sage@newdream.net>
"xattr" is never NULL here. We took care of that in the previous
if statement block.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Following Nick Piggin patches in btrfs, pagecache pages should be
allocated with __page_cache_alloc, so they obey pagecache memory
policies.
Also, using add_to_page_cache_lru instead of using a private
pagevec where applicable.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
The uniqueid field sent by the server when unix extensions are enabled
is currently used sometimes when it shouldn't be. The readdir codepath
is correct, but most others are not. Fix it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We use this value to find an inode within the hash bucket, so we can't
change this without re-hashing the inode. For now, treat this value
as immutable.
Eventually, we should probably use an inode number change on a path
based operation to indicate that the lookup cache is invalid, but that's
a bit more code to deal with.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The old cifs_revalidate logic always revalidated hardlinked inodes.
This hack allowed CIFS to pass some connectathon tests when server inode
numbers aren't used (basic test7, in particular).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Sparse does not like inline function declared without body,
because it is not part of the standard kernel practice.
The xattr_handler tables can be declared static.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Sparse detected that unsigned pointer was being passed as int pointer.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
[fixed up to deal with code refactoring]
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Add new extended inode types that store the xattr_id field.
Also add the necessary code changes to make xattrs visibile.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
This patch adds support for mapping xattr ids (stored in inodes)
into the on-disk location of the xattrs themselves.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
If we abort a request, we return to caller, but the request may still
complete. And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request. A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').
Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases. Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.
Signed-off-by: Sage Weil <sage@newdream.net>
When we abort requests we need to prevent fill_trace et al from doing
anything that relies on locks held by the VFS caller. This fixes a race
between the reply handler and the abort code, ensuring that continue
holding the dir mutex until the reply handler completes.
Signed-off-by: Sage Weil <sage@newdream.net>
We would occasionally BUG out in the reply handler because r_reply was
nonzero, due to a race with ceph_mdsc_do_request temporarily setting
r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong
in the EIO case.
Clean up by consistently using r_err for errors and r_reply for messages.
Also fix the abort logic to trigger consistently for all errors that return
to the caller early (e.g., EIO from timeout case). If an abort races with
a reply, use the result from the reply.
Also fix locking for r_err, r_reply update in the reply handler.
Signed-off-by: Sage Weil <sage@newdream.net>
Add a new ext4 state to tell us when a file has been newly created; use
that state in ext4_sync_file in no-journal mode to tell us when we need
to sync the parent directory as well as the inode and data itself. This
fixes a problem in which a panic or power failure may lose the entire
file even when using fsync, since the parent directory entry is lost.
Addresses-Google-Bug: #2480057
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Filesystems with delalloc support may dirty inode during writepages.
As result inode will have dirty metadata flags even after write_inode.
In fact we have two dedicated functions for proper data and metadata
writeback. It is reasonable to separate flags updates in two stages.
https://bugzilla.kernel.org/show_bug.cgi?id=15906
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
writeback to kick off writeback of pending dirty inodes, then follow
that up with a WB_SYNC_ALL to wait for it. Since umount already holds
the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
since WB_SYNC_ALL writeback is a data integrity operation and thus
a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
it's a lot slower.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Prior to 2.6.32, setting /proc/sys/vm/dirty_writeback_centisecs disabled
periodic dirty writeback from kupdate. This got broken and now causes
excessive sys CPU usage if set to zero, as we'll keep beating on
schedule().
Cc: stable@kernel.org
Reported-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For kmap_atomic() we call kunmap_atomic() on the returned pointer.
That's different from kmap() and kunmap() and so it's easy to get them
backwards.
Cc: Stable <stable@kernel.org>
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
All vectors of address_space_operations should be initialized
by the filesystem. Add the missing parts.
This is actually an optimization, by using
__set_page_dirty_nobuffers. The default, in case of NULL,
would be __set_page_dirty_buffers which has these extar if(s).
.releasepage && .invalidatepage should both not be called
because page_private() is NULL in exofs. Put a WARN_ON if
they are called, to indicate the Kernel has changed in this
regard, if when it does.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
struct ext4_new_group_input needs to be converted because u64 has
only 32-bit alignment on some 32-bit architectures, notably i386.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It is unnecessary, and in general impossible, to define the compat
ioctl numbers except when building the filesystem with CONFIG_COMPAT
defined.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If i_data_sem was internally dropped due to transaction restart, it is
necessary to restart path look-up because extents tree was possibly
modified by ext4_get_block().
https://bugzilla.kernel.org/show_bug.cgi?id=15827
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Dimitry Monakhov discovered an edge case where it was possible for the
EXT4_EOFBLOCKS_FL flag could get cleared unnecessarily. This is true;
I have a test case that can be exercised via downloading and
decompressing the file:
wget ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-testcases/eofblocks-fl-test-case.img.bz2
bunzip2 eofblocks-fl-test-case.img
dd if=/dev/zero of=eofblocks-fl-test-case.img bs=1k seek=17925 bs=1k count=1 conv=notrunc
However, triggering it in real life is highly unlikely since it
requires an extremely fragmented sparse file with a hole in exactly
the right place in the extent tree. (It actually took quite a bit of
work to generate this test case.) Still, it's nice to get even
extreme corner cases to be correct, so this patch makes sure that we
don't clear the EXT4_EOFBLOCKS_FL incorrectly even in this corner
case.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Convert ncp_ioctl to an unlocked_ioctl and push down the bkl into it.
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Petr Vandrovec <vandrove@vc.cvut.cz>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Convert coda_pioctl to an unlocked_ioctl pushing down the BKL
into it.
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Converting from ->ioctl to ->unlocked_ioctl with explicit
lock_kernel lets us kill the ioctl operation.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[fixed inode reference in smb_ioctl]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The ioctl function returns constant results, so it obviously
does not need the BKL and can be converted to unlocked_ioctl.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
HFS is one of the remaining users of the ->ioctl function, convert it
blindly to unlocked_ioctl by pushing down the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
If the EOFBLOCK_FL flag is set when it should not be and the inode is
zero length, then eh_entries is zero, and ex is NULL, so dereferencing
ex to print ex->ee_block causes a kernel OOPS in
ext4_ext_map_blocks().
On top of that, the error message which is printed isn't very helpful.
So we fix this by printing something more explanatory which doesn't
involve trying to print ex->ee_block.
Addresses-Google-Bug: #2655740
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.
https://bugzilla.kernel.org/show_bug.cgi?id=15792
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are no more users of procfs that implement the ioctl
callback. Drop the bkl from this path and warn on any use
of this callback.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
EXT4_ERROR_INODE() tends to provide better error information and in a
more consistent format. Some errors were not even identifying the inode
or directory which was corrupted, which made them not very useful.
Addresses-Google-Bug: #2507977
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This saves a huge amount of stack space by avoiding unnecesary struct
buffer_head's from being allocated on the stack.
In addition, to make the code easier to understand, collapse and
refactor ext4_get_block(), ext4_get_block_write(),
noalloc_get_block_write(), into a single function.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
FAT does not require the BKL in its ioctl function, which is already serialized
through a mutex. Since we're already touching the ioctl code, also fix the
missing handling of FAT_IOCTL_GET_ATTRIBUTES in the compat code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Jack up ext4_get_blocks() and add a new function, ext4_map_blocks()
which uses a much smaller structure, struct ext4_map_blocks which is
20 bytes, as opposed to a struct buffer_head, which nearly 5 times
bigger on an x86_64 machine. By switching things to use
ext4_map_blocks(), we can save stack space by using ext4_map_blocks()
since we can avoid allocating a struct buffer_head on the stack.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make a copy of write_cache_pages() for the benefit of
ext4_da_writepages(). This allows us to simplify the code some, and
will allow us to further customize the code in future patches.
There are some nasty hacks in write_cache_pages(), which Linus has
(correctly) characterized as vile. I've just copied it into
write_cache_pages_da(), without trying to clean those bits up lest I
break something in the ext4's delalloc implementation, which is a bit
fragile right now. This will allow Dave Chinner to clean up
write_cache_pages() in mm/page-writeback.c, without worrying about
breaking ext4. Eventually write_cache_pages_da() will go away when I
rewrite ext4's delayed allocation and create a general
ext4_writepages() which is used for all of ext4's writeback. Until
now this is the lowest risk way to clean up the core
write_cache_pages() function.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
We failed to show journal_checksum option in /proc/mounts. Fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix ext4_mb_collect_stats() to use the correct test for s_bal_success; it
should be testing "best-extent.fe_len >= orig-extent.fe_len" , not
"orig-extent.fe_len >= goal-extent.fe_len" .
Signed-off-by: Curt Wohlgemuth <curtw@google.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This adds a new field in ext4_group_info to cache the largest available
block range in a block group; and don't load the buddy pages until *after*
we've done a sanity check on the block group.
With large allocation requests (e.g., fallocate(), 8MiB) and relatively full
partitions, it's easy to have no block groups with a block extent large
enough to satisfy the input request length. This currently causes the loop
during cr == 0 in ext4_mb_regular_allocator() to load the buddy bitmap pages
for EVERY block group. That can be a lot of pages. The patch below allows
us to call ext4_mb_good_group() BEFORE we load the buddy pages (although we
have check again after we lock the block group).
Addresses-Google-Bug: #2578108
Addresses-Google-Bug: #2704453
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently using posix_fallocate one can bypass an RLIMIT_FSIZE limit
and create a file larger than the limit. Add a check for that.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>