Currently only root can trigger a referral automount because only root
can access rpc_pipefs directories. Enabling read access to non-root
should be harmless (they can still not access the pipes themselves)
and will suffice to fix this problem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The tk_pid field is an unsigned short. The proper print format specifier for
that type is %5u, not %4d.
Also clean up some miscellaneous print formatting nits.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The tk_pid field is an unsigned short. The proper print format specifier for
that type is %5u, not %4d.
Also clean up some miscellaneous print formatting nits.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The error values are already propagated through task->tk_status, and
none of the callers check one without checking the other, so we can
drop the return value.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also remove {NFSD,RPC}_PARANOIA as having the defines doesn't really add
anything.
The printks covered by RPC_PARANOIA were triggered by badly formatted
packets and so should be ratelimited.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSd assumes that largest number of pages that will be needed for a
request+response is 2+N where N pages is the size of the largest permitted
read/write request. The '2' are 1 for the non-data part of the request, and 1
for the non-data part of the reply.
However, when a read request is not page-aligned, and we choose to use
->sendfile to send it directly from the page cache, we may need N+1 pages to
hold the whole reply. This can overflow and array and cause an Oops.
This patch increases size of the array for holding pages by one and makes sure
that entry is NULL when it is not in use.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to silly typos, if the nfs versions are explicitly set, no NFSACL versions
get enabled.
Also improve an error message that would have made this bug a little easier to
find.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the Oops in http://bugzilla.linux-nfs.org/show_bug.cgi?id=138
We shouldn't be calling rpc_release_task() for tasks that are not active.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return error and prevent from loading module when gss_mech_register()
failed.
Cc: Andy Adamson <andros@citi.umich.edu>
Cc: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There's no point deferring something just to immediately fail the deferral,
especially now that we can do something more useful in the failure case by
returning an error.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To avoid tying up server threads when nfsd makes an upcall (to mountd, to get
export options, to idmapd, for nfsv4 name<->id mapping, etc.), we temporarily
"drop" the request and save enough information so that we can revisit it
later.
Certain failures during the deferral process can cause us to really drop the
request and never revisit it.
This is often less than ideal, and is unacceptable in the NFSv4 case--rfc 3530
forbids the server from dropping a request without also closing the
connection.
As a first step, we modify the deferral code to return -ETIMEDOUT (which is
translated to nfserr_jukebox in the v3 and v4 cases, and remains a drop in the
v2 case).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The memory leak here is embarassingly obvious.
This fixes a problem that causes the kernel to leak a small amount of memory
every time it receives a integrity-protected request.
Thanks to Aim Le Rouzic for the bug report.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All kcalloc() calls of the form "kcalloc(1,...)" are converted to the
equivalent kzalloc() calls, and a few kcalloc() calls with the incorrect
ordering of the first two arguments are fixed.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Name some of the remaning 'old_style_spin_init' locks
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Stick NFS sockets in their own class to avoid some lockdep warnings. NFS
sockets are never exposed to user-space, and will hence not trigger certain
code paths that would otherwise pose deadlock scenarios.
[akpm@osdl.org: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Dickson <SteveD@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Neil Brown <neilb@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
[ Fixed patch corruption by quilt, pointed out by Peter Zijlstra ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_KERNEL is an alias of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These appear to be deprecated. Removing them also gets rid of some sparse
noise.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up:
The RPC client currently creates some sysctls that are specific to the
socket transport. Move those entirely into xprtsock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Over time we will want to add some specific init and cleanup logic for the
xprtsock implementation. Add stub routines for initialization and exit
processing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: hch suggested that the RPC client shouldn't pollute the name
space used by the generic skb manipulation routines in net/core/skbuff.c.
Rename a couple of types in xdr.h to adhere to this convention.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: eliminate xs_tcp_copy_data -- it's exactly the same logic as the
common routine skb_read_bits. The UDP and TCP socket read code now share
the same routine for copying data into an xdr_buf.
Now that skb_read_bits() is exported, rename it to avoid confusing it with
a generic skb_* function. As these functions are XDR-specific, they should
not have names that suggest they are of generic use. Also rename
skb_read_and_csum_bits() to be consistent.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For now we will assume that all transports will use the address format
buffers in the rpc_xprt struct to store their addresses. Change
rpc_peer2str() to be a generic routine to handle this, and get rid of the
print_address() op in the rpc_xprt_ops vector.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the three fields for saving socket callback functions out of the
rpc_xprt structure and into a private data structure maintained in
net/sunrpc/xprtsock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the socket-specific buffer size parameters for UDP sockets to a
private data structure maintained in net/sunrpc/xprtsock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the socket-specific connection management fields out of the generic
rpc_xprt structure into a private data structure maintained in
net/sunrpc/xprtsock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move "XPRT_LAST_FRAG" and friends from xprt.h into xprtsock.c, and rename
them to use the naming scheme in use in xprtsock.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the TCP receive state variables from the generic rpc_xprt structure to
a private structure maintained inside net/sunrpc/xprtsock.c.
Also rename a function/variable pair to refer to RPC fragment headers
instead of record markers, to be consistent with types defined in
sunrpc/*.h.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The "sock" and "inet" fields are socket-specific. Move them to a private
data structure maintained entirely within net/sunrpc/xprtsock.c
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When setting up a new transport instance, allocate enough memory for an
rpc_xprt and a private area. As part of the same memory allocation, it
will be easy to find one, given a pointer to the other.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're currently not actually using seed or seed_init.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sealalg is checked in several places, giving the impression it could be
either SEAL_ALG_NONE or SEAL_ALG_DES. But in fact SEAL_ALG_NONE seems to
be sufficient only for making mic's, and all the contexts we get must be
capable of wrapping as well. So the sealalg must be SEAL_ALG_DES. As
with signalg, just check for the right value on the downcall and ignore it
otherwise. Similarly, tighten expectations for the sealalg on incoming
tokens, in case we do support other values eventually.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove some unnecessary goto labels; clean up some return values; etc.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're doing some pointless translation between krb5 constants and kernel
crypto string names.
Also clean up some related spkm3 code as necessary.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Previous changes reveal some obvious cruft.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We also only ever receive one value of the signalg, so let's not pretend
otherwise
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We designed the krb5 context import without completely understanding the
context. Now it's clear that there are a number of fields that we ignore,
or that we depend on having one single value.
In particular, we only support one value of signalg currently; so let's
check the signalg field in the downcall (in case we decide there's
something else we could support here eventually), but ignore it otherwise.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This updates the spkm3 code to bring it up to date with our current
understanding of the spkm3 spec.
In doing so, we're changing the downcall format used by gssd in the spkm3 case,
which will cause an incompatilibity with old userland spkm3 support. Since the
old code a) didn't implement the protocol correctly, and b) was never
distributed except in the form of some experimental patches from the citi web
site, we're assuming this is OK.
We do detect the old downcall format and print warning (and fail). We also
include a version number in the new downcall format, to be used in the
future in case any further change is required.
In some more detail:
- fix integrity support
- removed dependency on NIDs. instead OIDs are used
- known OID values for algorithms added.
- fixed some context fields and types
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since process_xdr_buf() is useful outside of the kerberos-specific code, we
move it to net/sunrpc/xdr.c, export it, and rename it in keeping with xdr_*
naming convention of xdr.c.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This code is never called from interrupt context; it's always run by either
a user thread or rpciod. So KM_SKB_SUNRPC_DATA is inappropriate here.
Thanks to Aimé Le Rouzic for capturing an oops which showed the kernel
taking an interrupt while we were in this piece of code, resulting in a
nested kmap_atomic(.,KM_SKB_SUNRPC_DATA) call from
xdr_partial_copy_from_skb().
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Dumping all this data to the logs is wasteful (even when debugging is turned
off), and creates too much output to be useful when it's turned on.
Fix a minor style bug or two while we're at it.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't wake up bind waiters if a task finds that another task is already
trying to bind.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change the location where the rpc_xprt structure is allocated so each
transport implementation can allocate a private area from the same
chunk of memory.
Note also that xprt->ops->destroy, rather than xprt_destroy, is now
responsible for freeing rpc_xprt when the transport is destroyed.
Test plan:
Connectathon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All internal RPC client operations should no longer depend on the BKL,
however lockd and NFS callbacks may still require it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use RCU to ensure that we can safely call rpc_finish_wakeup after we've
called __rpc_do_wake_up_task. If not, there is a theoretical race, in which
the rpc_task finishes executing, and gets freed first.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sunrpc scheduler contains a race condition that can let an RPC
task end up being neither running nor on any wait queue. The race takes
place between rpc_make_runnable (called from rpc_wake_up_task) and
__rpc_execute under the following condition:
First __rpc_execute calls tk_action which puts the task on some wait
queue. The task is dequeued by another process before __rpc_execute
continues its execution. While executing rpc_make_runnable exactly after
setting the task `running' bit and before clearing the `queued' bit
__rpc_execute picks up execution, clears `running' and subsequently
both functions fall through, both under the false assumption somebody
else took the job.
Swapping rpc_test_and_set_running with rpc_clear_queued in
rpc_make_runnable fixes that hole. This introduces another possible
race condition that can be handled by checking for `queued' after
setting the `running' bit.
Bug noticed on a 4-way x86_64 system under XEN with an NFSv4 server
on the same physical machine, apparently one of the few ways to hit
this race condition at all.
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Christophe Saout <christophe@saout.de>
Signed-off-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Conflicts:
drivers/infiniband/core/iwcm.c
drivers/net/chelsio/cxgb2.c
drivers/net/wireless/bcm43xx/bcm43xx_main.c
drivers/net/wireless/prism54/islpci_eth.c
drivers/usb/core/hub.h
drivers/usb/input/hid-core.c
net/core/netpoll.c
Fix up merge failures with Linus's head and fix new compilation failures.
Signed-Off-By: David Howells <dhowells@redhat.com>
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.
For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.
To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.
Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).
However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().
In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).
Signed-Off-By: David Howells <dhowells@redhat.com>
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.
The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
architecture it's nearly 100 bytes in size. This reduces that by half for the
non-delayable type of event.
Signed-Off-By: David Howells <dhowells@redhat.com>
auth_domain_put() forgot to unlock acquired spinlock.
Cc: Olaf Kirch <okir@monad.swb.de>
Cc: Andy Adamson <andros@citi.umich.edu>
Cc: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A recent patch fixed a problem which would occur when the refcount on an
auth_domain reached zero. This problem has not been reported in practice
despite existing in two major kernel releases because the refcount can
never reach zero.
This patch fixes the problems that stop the refcount reaching zero.
1/ We were adding to the refcount when inserting in the hash table,
but only removing from the hashtable when the refcount reached zero.
Obviously it never would. So don't count the implied reference of
being in the hash table.
2/ There are two paths on which a socket can be destroyed. One called
svcauth_unix_info_release(). The other didn't. So when the other was
taken, we can lose a reference to an ip_map which in-turn holds a
reference to an auth_domain
So unify the exit paths into svc_sock_put. This highlights the fact
that svc_delete_socket has slightly odd semantics - it does not drop
a reference but probably should. Fixing this need a bit more
thought and testing.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is suitable for just about any 2.6 kernel. It should go in
2.6.19 and 2.6.18.2 and possible even the .17 and .16 stable series.
This is a long standing bug that seems to have only recently become
apparent, presumably due to increasing use of NFS over TCP - many
distros seem to be making it the default.
The SK_CONN bit gets set when a listening socket may be ready
for an accept, just as SK_DATA is set when data may be available.
It is entirely possible for svc_tcp_accept to be called with neither
of these set. It doesn't happen often but there is a small race in
svc_sock_enqueue as SK_CONN and SK_DATA are tested outside the
spin_lock. They could be cleared immediately after the test and
before the lock is gained.
This normally shouldn't be a problem. The sockets are non-blocking so
trying to read() or accept() when ther is nothing to do is not a problem.
However: svc_tcp_recvfrom makes the decision "Should I accept() or
should I read()" based on whether SK_CONN is set or not. This usually
works but is not safe. The decision should be based on whether it is
a TCP_LISTEN socket or a TCP_CONNECTED socket.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: <stable@kernel.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Yes, this actually passed tests the way it was.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When submitting a request to a fast portmapper (such as the local rpcbind
daemon), the request can complete before the parent task is even queued up on
xprt->binding. Fix this by queuing before submitting the rpcbind request.
Test plan:
Connectathon locking test with UDP.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is possible for the ->fopen callback from lockd into nfsd to find that an
answer cannot be given straight away (an upcall is needed) and so the request
has to be 'dropped', to be retried later. That error status is not currently
propagated back.
So:
Change nlm_fopen to return nlm error codes (rather than a private
protocol) and define a new nlm_drop_reply code.
Cause nlm_drop_reply to cause the rpc request to get rpc_drop_reply
when this error comes back.
Cause svc_process to drop a request which returns a status of
rpc_drop_reply.
[akpm@osdl.org: fix warning storm]
Cc: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is some confusion about the meaning of 'bufsz' for a sunrpc server.
In some cases it is the largest message that can be sent or received. In
other cases it is the largest 'payload' that can be included in a NFS
message.
In either case, it is not possible for both the request and the reply to be
this large. One of the request or reply may only be one page long, which
fits nicely with NFS.
So we remove 'bufsz' and replace it with two numbers: 'max_payload' and
'max_mesg'. Max_payload is the size that the server requests. It is used
by the server to check the max size allowed on a particular connection:
depending on the protocol a lower limit might be used.
max_mesg is the largest single message that can be sent or received. It is
calculated as the max_payload, rounded up to a multiple of PAGE_SIZE, and
with PAGE_SIZE added to overhead. Only one of the request and reply may be
this size. The other must be at most one page.
Cc: Greg Banks <gnb@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The rpc reply has multiple levels of error returns. The code here contributes
to the confusion by using "accept_statp" for a pointer to what the rfc (and
wireshark, etc.) refer to as the "reply_stat". (The confusion is compounded
by the fact that the rfc also has an "accept_stat" which follows the
reply_stat in the succesful case.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the request is denied after gss_accept was called, we shouldn't try to wrap
the reply. We were checking the accept_stat but not the reply_stat.
To check the reply_stat in _release, we need a pointer to before (rather than
after) the verifier, so modify body_start appropriately.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Factor out some common code from the integrity and privacy cases.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NFSACL patches introduced support for multiple RPC services listening on
the same transport. However, only the first of these services was registered
with portmapper. This was perfectly fine for nfsacl, as you traditionally do
not want these to show up in a portmapper listing.
The patch below changes the default behavior to always register all services
listening on a given transport, but retains the old behavior for nfsacl
services.
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Speed up high call-rate workloads by caching the struct ip_map for the peer on
the connected struct svc_sock instead of looking it up in the ip_map cache
hashtable on every call. This helps workloads using AUTH_SYS authentication
over TCP.
Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480 regular files
in 10841 directories) on the server. That tree is small enough to fill in the
server's RAM so no disk traffic was involved. This setup gives a sustained
call rate in excess of 60000 calls/sec before being CPU-bound on the server.
Profiling showed strcmp(), called from ip_map_match(), was taking 4.8% of each
CPU, and ip_map_lookup() was taking 2.9%. This patch drops both contribution
into the profile noise.
Note that the above result overstates this value of this patch for most
workloads. The synthetic clients are all using separate IP addresses, so
there are 64 entries in the ip_map cache hash. Because the kernel measured
contained the bug fixed in commit
commit 1f1e030bf7
and was running on 64bit little-endian machine, probably all of those 64
entries were on a single chain, thus increasing the cost of ip_map_lookup().
With a modern kernel you would need more clients to see the same amount of
performance improvement. This patch has helped to scale knfsd to handle a
deployment with 2000 NFS clients.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The limit over UDP remains at 32K. Also, make some of the apparently
arbitrary sizing constants clearer.
The biggest change here involves replacing NFSSVC_MAXBLKSIZE by a function of
the rqstp. This allows it to be different for different protocols (udp/tcp)
and also allows it to depend on the servers declared sv_bufsiz.
Note that we don't actually increase sv_bufsz for nfs yet. That comes next.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
.. by allocating the array of 'kvec' in 'struct svc_rqst'.
As we plan to increase RPCSVC_MAXPAGES from 8 upto 256, we can no longer
allocate an array of this size on the stack. So we allocate it in 'struct
svc_rqst'.
However svc_rqst contains (indirectly) an array of the same type and size
(actually several, but they are in a union). So rather than waste space, we
move those arrays out of the separately allocated union and into svc_rqst to
share with the kvec moved out of svc_tcp_recvfrom (various arrays are used at
different times, so there is no conflict).
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are planning to increase RPCSVC_MAXPAGES from about 8 to about 256. This
means we need to be a bit careful about arrays of size RPCSVC_MAXPAGES.
struct svc_rqst contains two such arrays. However the there are never more
that RPCSVC_MAXPAGES pages in the two arrays together, so only one array is
needed.
The two arrays are for the pages holding the request, and the pages holding
the reply. Instead of two arrays, we can simply keep an index into where the
first reply page is.
This patch also removes a number of small inline functions that probably
server to obscure what is going on rather than clarify it, and opencode the
needed functionality.
Also remove the 'rq_restailpage' variable as it is *always* 0. i.e. if the
response 'xdr' structure has a non-empty tail it is always in the same pages
as the head.
check counters are initilised and incr properly
check for consistant usage of ++ etc
maybe extra some inlines for common approach
general review
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Magnus Maatta <novell@kiruna.se>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Arrgg.. We cannot 'lockd_up' before 'svc_addsock' as we don't know the
protocol yet.... So switch it around again and save the name of the created
sockets so that it can be closed if lock_up fails.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The refcount that nfsd holds on lockd is based on the number of open sockets.
So when we close a socket, we should decrement the ref (with lockd_down).
Currently when a socket is closed via writing to the portlist file, that
doesn't happen.
So: make sure we get an error return if the socket that was requested does is
not found, and call lockd_down if it was.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many files include the filename at the beginning, serveral used a wrong one.
Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Replace references to system_utsname to the per-process uts namespace
where appropriate. This includes things like uname.
Changes: Per Eric Biederman's comments, use the per-process uts namespace
for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c
[jdike@addtoit.com: UML fix]
[clg@fr.ibm.com: cleanup]
[akpm@osdl.org: build fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Actually implement multiple pools. On NUMA machines, allocate a svc_pool per
NUMA node; on SMP a svc_pool per CPU; otherwise a single global pool. Enqueue
sockets on the svc_pool corresponding to the CPU on which the socket bh is run
(i.e. the NIC interrupt CPU). Threads have their cpu mask set to limit them
to the CPUs in the svc_pool that owns them.
This is the patch that allows an Altix to scale NFS traffic linearly
beyond 4 CPUs and 4 NICs.
Incorporates changes and feedback from Neil Brown, Trond Myklebust, and
Christoph Hellwig.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently knfsd keeps its own list of all nfsd threads in nfssvc.c; add a new
way of managing the list of all threads in a svc_serv. Add
svc_create_pooled() to allow creation of a svc_serv whose threads are managed
by the sunrpc code. Add svc_set_num_threads() to manage the number of threads
in a service, either per-pool or globally across the service.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Split out the list of idle threads and pending sockets from svc_serv into a
new svc_pool structure, and allocate a fixed number (in this patch, 1) of
pools per svc_serv. The new structure contains a lock which takes over
several of the duties of svc_serv->sv_lock, which is now relegated to
protecting only sv_tempsocks, sv_permsocks, and sv_tmpcnt in svc_serv.
The point is to move the hottest fields out of svc_serv and into svc_pool,
allowing a following patch to arrange for a svc_pool per NUMA node or per CPU.
This is a major step towards making the NFS server NUMA-friendly.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The SK_BUSY bit in svc_sock->sk_flags ensures that we do not attempt to
enqueue a socket twice. Currently, setting and clearing the bit is protected
by svc_serv->sv_lock. As I intend to reduce the data that the lock protects
so it's not held when svc_sock_enqueue() tests and sets SK_BUSY, that test and
set needs to be atomic.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert the svc_sock->sk_reserved variable from an int protected by
svc_serv->sv_lock, to an atomic. This reduces (by 1) the number of places we
need to take the (effectively global) svc_serv->sv_lock.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Protect the svc_sock->sk_deferred list with a new lock svc_sock->sk_defer_lock
instead of svc_serv->sv_lock. Using the more fine-grained lock reduces the
number of places we need to take the svc_serv lock.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert the svc_sock->sk_inuse counter from an int protected by
svc_serv->sv_lock, to an atomic. This reduces the number of places we need to
take the (effectively global) svc_serv->sv_lock.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following are 11 patches from Greg Banks which combine to make knfsd more
Numa-aware. They reduce hitting on 'global' data structures, and create some
data-structures that can be node-local.
knfsd threads are bound to a particular node, and the thread to handle a new
request is chosen from the threads that are attach to the node that received
the interrupt.
The distribution of threads across nodes can be controlled by a new file in
the 'nfsd' filesystem, though the default approach of an even spread is
probably fine for most sites.
Some (old) numbers that show the efficacy of these patches: N == number of
NICs == number of CPUs == nmber of clients. Number of NUMA nodes == N/2
N Throughput, MiB/s CPU usage, % (max=N*100)
Before After Before After
--- ------ ---- ----- -----
4 312 435 350 228
6 500 656 501 418
8 562 804 690 589
This patch:
Move the aging of RPC/TCP connection sockets from the main svc_recv() loop to
a timer which uses a mark-and-sweep algorithm every 6 minutes. This reduces
the amount of work that needs to be done in the main RPC loop and the length
of time we need to hold the (effectively global) svc_serv->sv_lock.
[akpm@osdl.org: cleanup]
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It isn't needed as it is available in rqstp->rq_server, and dropping it allows
some local vars to be dropped.
[akpm@osdl.org: build fix]
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Userspace should create and bind a socket (but not connectted) and write the
'fd' to portlist. This will cause the nfs server to listen on that socket.
To close a socket, the name of the socket - as read from 'portlist' can be
written to 'portlist' with a preceding '-'.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This file will list all ports that nfsd has open.
Default when TCP enabled will be
ipv4 udp 0.0.0.0 2049
ipv4 tcp 0.0.0.0 2049
Later, the list of ports will be settable.
'portlist' chosen rather than 'ports', to avoid unnecessary confusion with
non-mainline patches which created 'ports' with different semantics.
[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nfsd has some cleanup that it wants to do when the last thread exits, and
there will shortly be some more. So collect this all into one place and
define a callback for an rpc service to call when the service is about to be
destroyed.
[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is mostly included for parity with dec_nlink(), where we will have some
more hooks. This one should stay pretty darn straightforward for now.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
coverity spotted this one as possible dereference in the dprintk(),
but since there is only one caller of svc_create_socket(), which always
passes a valid sin, we dont need this check.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
pure s/u32/__be32/
[AV: large part based on Alexey's patches]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* add svc_getnl():
Take network-endian value from buffer, convert to host-endian
and return it.
* add svc_putnl():
Take host-endian value, convert to network-endian and put it
into a buffer.
* annotate svc_getu32()/svc_putu32() as dealing with network-endian.
* convert to svc_getnl(), svc_putnl().
[AV: in large part it's a carved-up Alexey's patch]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.
Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.
[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:
(void)kmem_cache_destroy(cache);
* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (74 commits)
NFS: unmark NFS direct I/O as experimental
NFS: add comments clarifying the use of nfs_post_op_update()
NFSv4: rpc_mkpipe creating socket inodes w/out sk buffers
NFS: Use SEEK_END instead of hardcoded value
NFSv4: When mounting with a port=0 argument, substitute port=2049
NFSv4: Poll more aggressively when handling NFS4ERR_DELAY
NFSv4: Handle the condition NFS4ERR_FILE_OPEN
NFSv4: Retry lease recovery if it failed during a synchronous operation.
NFS: Don't invalidate the symlink we just stuffed into the cache
NFS: Make read() return an ESTALE if the file has been deleted
NFSv4: It's perfectly legal for clp to be NULL here....
NFS: nfs_lookup - don't hash dentry when optimising away the lookup
SUNRPC: Fix Oops in pmap_getport_done
SUNRPC: Add refcounting to the struct rpc_xprt
SUNRPC: Clean up soft task error handling
SUNRPC: Handle ENETUNREACH, EHOSTUNREACH and EHOSTDOWN socket errors
SUNRPC: rpc_delay() should not clobber the rpc_task->tk_status
Fix a referral error Oops
NFS: NFS_ROOT should use the new rpc_create API
NFS: Fix up compiler warnings on 64-bit platforms in client.c
...
Manually resolved conflict in net/sunrpc/xprtsock.c
This patch stop rpc_mkpipe from create S_IFSOCK nodes what don't
have associated sk buffers attached (which causes SELinux to oops
during NFSv4 mounts). Instead the S_IFIFO mode bit is set which
probably make more sense and seems to work just fine during
my connectathon and fsx testing...
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no guarantee that the parent task still exists when we exit from
the portmapper. Save the xprt instead.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In a subsequent patch, this will allow the portmapper to take a reference
to the rpc_xprt for which it is updating the port number, fixing an Oops.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- Ensure that the task aborts the RPC call only when it has actually timed out.
- Ensure that req->rq_majortimeo is initialised correctly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In case of any of the above errors occuring, delay for 3 seconds, then
handle as if it were a timeout error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch is optional.
It has been suggested that the RPC client internal functions used by upper
layer protocols (such as NFS) be exported via EXPORT_SYMBOL_GPL. This
patch does that.
Test plan:
Compile kernel with CONFIG_NFS enabled as a module.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The two function call API for creating a new RPC client is now obsolete.
Remove it.
Also, remove an unnecessary check to see whether the caller is capable of
using privileged network services. The kernel RPC client always uses a
privileged ephemeral port by default; callers are responsible for checking
the authority of users to make use of any RPC service, or for specifying
that a nonprivileged port is acceptable.
Test plan:
Repeated runs of Connectathon locking suite. Check network trace to ensure
correctness of NLM requests and replies.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace xprt_create_proto/rpc_create_client calls in pmap_clnt.c with new
rpc_create() API.
Test plan:
Repeated runs of Connectathon locking suite. Check network trace for
proper PMAP calls and replies.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare for more generic transport endpoint handling needed by transports
that might use different forms of addressing, such as IPv6.
Introduce a single function call to replace the two-call
xprt_create_proto/rpc_create_client API. Define a new rpc_create_args
structure that allows callers to pass in remote endpoint addresses of
varying length.
Test-plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
IPv6 addresses are big (128 bytes). Now that no RPC client consumers treat
the addr field in rpc_xprt structs as an opaque, and access it only via the
API calls, we can safely widen the field in the rpc_xprt struct to
accomodate larger addresses.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hide the details of how the RPC client stores remote peer addresses from
the RPC pipefs implementation.
Test plan:
Connectathon with Kerberos 5 authentication.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Provide an API for formatting the remote peer address for printing without
exposing its internal structure. The address could be dynamic, so we
support a function call to get the address rather than reading it straight
out of a structure.
Test-plan:
Destructive testing (unplugging the network temporarily). Probably need
to rig a server where certain services aren't running, or that returns an
error for some typical operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a new method to the transport switch API to provide a way to convert
the opaque contents of xprt->addr to a human-readable string.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
include/linux/sunrpc/clnt.h already includes include/linux/sunrpc/xprt.h.
We can remove xprt.h from source files that already include clnt.h.
Likewise include/linux/sunrpc/timer.h.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hide the details of how the RPC client stores remote peer addresses from
the RPC portmapper.
Test plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Provide an API for retrieving the remote peer address without allowing
direct access to the rpc_xprt struct.
Test-plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a clean transport switch API for plugging in different types of
rpcbind mechanisms. For instance, rpcbind can cleanly replace the
existing portmapper client, or a transport can choose to implement RPC
binding any way it likes.
Test plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The previous patches removed the last user of RPC child tasks, so we can
remove support for child tasks from net/sunrpc/sched.c now.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add comments for external functions, use modern function definition style,
and fix up dprintk formatting.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move connection and bind state that was maintained in the rpc_clnt
structure to the rpc_xprt structure. This will allow the creation of
a clean API for plugging in different types of bind mechanisms.
This brings improvements such as the elimination of a single spin lock to
control serialization for all in-kernel RPC binding. A set of per-xprt
bitops is used to serialize tasks during RPC binding, just like it now
works for making RPC transport connections.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hide the contents and format of xprt->addr by eliminating direct uses
of the xprt->addr.sin_port field. This change is required to support
alternate RPC host address formats (eg IPv6).
Test-plan:
Destructive testing (unplugging the network temporarily). Repeated runs of
Connectathon locking suite with UDP and TCP.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).
Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check the bounds of length specifiers more thoroughly in the XDR decoding of
NFS4 readdir reply data.
Currently, if the server returns a bitmap or attr length that causes the
current decode point pointer to wrap, this could go undetected (consider a
small "negative" length on a 32-bit machine).
Also add a check into the main XDR decode handler to make sure that the amount
of data is a multiple of four bytes (as specified by RFC-1014). This makes
sure that we can do u32* pointer subtraction in the NFS client without risking
an undefined result (the result is undefined if the pointers are not correctly
aligned with respect to one another).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 5861fddd64a7eaf7e8b1a9997455a24e7f688092 commit)
rpc_unlink() and rpc_rmdir() will dput the dentry reference for you.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from a05a57effa71a1f67ccbfc52335c10c8b85f3f6a commit)
A prior call to rpc_depopulate() by rpc_rmdir() on the parent directory may
have already called simple_unlink() on this entry.
Add the same check to rpc_rmdir(). Also remove a redundant call to
rpc_close_pipes() in rpc_rmdir.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 0bbfb9d20f6437c4031aa3bf9b4d311a053e58e3 commit)
Make it take a dentry argument instead of a path
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 648d4116eb2509f010f7f34704a650150309b3e7 commit)
If we don't find the item we are lookng for, we allocate a new one, and
then grab the lock again and search to see if it has been added while we
did the alloc. If it had been added we need to 'cache_put' the newly
created item that we are never going to use. But as it hasn't been
initialised properly, putting it can cause an oops.
So move the ->init call earlier to that it will always be fully initilised
if we have to put it.
Thanks to Philipp Matthias Hahn <pmhahn@svs.Informatik.Uni-Oldenburg.de>
for reporting the problem.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we're part way through transmitting a TCP request, and the client
errors, then we need to disconnect and reconnect the TCP socket in order to
avoid confusing the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 031a50c8b9ea82616abd4a4e18021a25848941ce commit)
* the client is ia64 or any platform that actually implements
flush_dcache_page(), and
* the server returns fsinfo.dtpref >= client's PAGE_SIZE, and
* the server does *not* return post-op attributes for the directory
in the READDIR reply.
Problem diagnosed by Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add i_mutex ordering annotations to the sunrpc rpc_pipe code. This code has 3
levels of i_mutex hierarchy in some cases: parent dir, client dir and file
inside client dir; the i_mutex ordering is I_MUTEX_PARENT -> I_MUTEX_CHILD ->
I_MUTEX_NORMAL
This patch applies this ordering annotation to the various functions. This is
in line with the VFS expected ordering where it is always OK to lock a child
after locking a parent; the sunrpc code is very diligent in doing this
correctly.
Has no effect on non-lockdep kernels.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Server-side implementation of rpcsec_gss privacy, which enables encryption of
the payload of every rpc request and response.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a rq_sendfile_ok flag to svc_rqst which will be cleared in the privacy
case so that the wrapping code will get copies of the read data instead of
real page cache pages. This makes life simpler when we encrypt the response.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pull out some of the integrity code into its own function, otherwise
svcauth_gss_release() is going to become very ungainly after the addition of
privacy code.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adopt a simpler convention for gss_mech_put(), to simplify rsc_parse().
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
locking init cleanups:
- convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
- convert rwlocks in a similar manner
this patch was generated automatically.
Motivation:
- cleanliness
- lockdep needs control of lock initialization, which the open-coded
variants do not give
- it's also useful for -rt and for lock debugging in general
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace all module uses with the new vfs_kern_mount() interface, and fix up
simple_pin_fs().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The XID generator uses get_random_bytes to generate an initial XID.
NFS_ROOT starts up before the random driver, though, so get_random_bytes
doesn't set a random XID for NFS_ROOT. This causes NFS_ROOT mount points
to reuse XIDs every time the client is booted. If the client boots often
enough, the server will start serving old replies out of its DRC.
Use net_random() instead.
Test plan:
I/O intensive workloads should perform well and generate no errors. Traces
taken during client reboots should show that NFS_ROOT mounts use unique
XIDs after every reboot.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the RPC client select privileged ephemeral source ports at
random. This improves DRC behavior on the server by using the
same port when reconnecting for the same mount point, but using
a different port for fresh mounts.
The Linux TCP implementation already does this for nonprivileged
ports. Note that TCP sockets in TIME_WAIT will prevent quick reuse
of a random ephemeral port number by leaving the port INUSE until
the connection transitions out of TIME_WAIT.
Test plan:
Connectathon against every known server implementation using multiple
mount points. Locking especially.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both cause the 'entries' count in the export cache to be non-zero at module
removal time, so unregistering that cache fails and results in an oops.
1/ exp_pseudoroot (used for NFSv4 only) leaks a reference to an export
entry.
2/ sunrpc_cache_update doesn't increment the entries count when it adds
an entry.
Thanks to "david m. richter" <richterd@citi.umich.edu> for triggering the
problem and finding one of the bugs.
Cc: "david m. richter" <richterd@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hi,
the coverity checker spotted that cred is always NULL
when we jump to out_err ( there is just one case, when
we fail to allocate the memory for cred )
This is Coverity ID #79
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I was sloppy when generating a previous patch; I modified the callers of
krb5_make_checksum() to allocate memory for the buffer where the result is
returned, then forgot to modify krb5_make_checksum to stop allocating that
memory itself. The result is a per-packet memory leak. This fixes the
problem by removing the now-superfluous kmalloc().
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're using svc_take_page here to get another page for the tail in case one
wasn't already allocated. But there isn't always guaranteed to be another
page available.
Also fix a typo that made us check the tail buffer for space when we meant to
be checking the head buffer.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark the f_ops members of inodes as const, as well as fix the
ripple-through this causes by places that copy this f_ops and then "do
stuff" with it.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We shouldn't really compare &new->h with anything when new ==NULL, and gather
three different if statements that all start
if (rv ...
into one large if.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We can now make some code static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
.. it makes some of the code nicer.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cache_fresh is now only used in cache.c, so unexport it.
Part of cache_fresh (setting CACHE_VALID) should really be done under the
lock, while part (calling cache_revisit_request etc) must be done outside the
lock. So we split it up appropriately.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- in cache_check, h must be non-NULL as it has been de-referenced,
so don't bother checking for NULL.
- When a cache-item is updated, we need to call cache_revisit_request to see
if there is a pending request waiting for that item. We were using
a transition to CACHE_VALID to see if that was needed, however that is
wrong as an expired entry will still be marked 'valid' (as the data is valid
and will need to be released). So instead use an off transition for
CACHE_PENDING which is exactly the right thing to test.
- Add a little bit more debugging info.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The C++-like 'template' approach proves to be too ugly and hard to work with.
The old 'template' won't go away until all users are updated.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These were an unnecessary wart. Also only have one 'DefineSimpleCache..'
instead of two.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The 'auth_domain's are simply handles on internal data structures. They do
not cache information from user-space, and forcing them into the mold of a
'cache' misrepresents their true nature and causes confusion.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify well over a dozen mempool users to call mempool_create_slab_pool()
rather than calling mempool_create() with extra arguments, saving about 30
lines of code and increasing readability.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (103 commits)
SUNRPC,RPCSEC_GSS: spkm3--fix config dependencies
SUNRPC,RPCSEC_GSS: spkm3: import contexts using NID_cast5_cbc
LOCKD: Make nlmsvc_traverse_shares return void
LOCKD: nlmsvc_traverse_blocks return is unused
SUNRPC,RPCSEC_GSS: fix krb5 sequence numbers.
NFSv4: Dont list system.nfs4_acl for filesystems that don't support it.
SUNRPC,RPCSEC_GSS: remove unnecessary kmalloc of a checksum
SUNRPC: Ensure rpc_call_async() always calls tk_ops->rpc_release()
SUNRPC: Fix memory barriers for req->rq_received
NFS: Fix a race in nfs_sync_inode()
NFS: Clean up nfs_flush_list()
NFS: Fix a race with PG_private and nfs_release_page()
NFSv4: Ensure the callback daemon flushes signals
SUNRPC: Fix a 'Busy inodes' error in rpc_pipefs
NFS, NLM: Allow blocking locks to respect signals
NFS: Make nfs_fhget() return appropriate error values
NFSv4: Fix an oops in nfs4_fill_super
lockd: blocks should hold a reference to the nlm_file
NFSv4: SETCLIENTID_CONFIRM should handle NFS4ERR_DELAY/NFS4ERR_RESOURCE
NFSv4: Send the delegation stateid for SETATTR calls
...
Rewrap the overly long source code lines resulting from the previous
patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch
contains only formatting changes, and no function change.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
memory spreading.
If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
from such a slab cache, the allocations are spread evenly over all the
memory nodes (task->mems_allowed) allowed to that task, instead of favoring
allocation on the node local to the current cpu.
The following inode and similar caches are marked SLAB_MEM_SPREAD:
file cache
==== =====
fs/adfs/super.c adfs_inode_cache
fs/affs/super.c affs_inode_cache
fs/befs/linuxvfs.c befs_inode_cache
fs/bfs/inode.c bfs_inode_cache
fs/block_dev.c bdev_cache
fs/cifs/cifsfs.c cifs_inode_cache
fs/coda/inode.c coda_inode_cache
fs/dquot.c dquot
fs/efs/super.c efs_inode_cache
fs/ext2/super.c ext2_inode_cache
fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr
fs/ext3/super.c ext3_inode_cache
fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr
fs/fat/cache.c fat_cache
fs/fat/inode.c fat_inode_cache
fs/freevxfs/vxfs_super.c vxfs_inode
fs/hpfs/super.c hpfs_inode_cache
fs/isofs/inode.c isofs_inode_cache
fs/jffs/inode-v23.c jffs_fm
fs/jffs2/super.c jffs2_i
fs/jfs/super.c jfs_ip
fs/minix/inode.c minix_inode_cache
fs/ncpfs/inode.c ncp_inode_cache
fs/nfs/direct.c nfs_direct_cache
fs/nfs/inode.c nfs_inode_cache
fs/ntfs/super.c ntfs_big_inode_cache_name
fs/ntfs/super.c ntfs_inode_cache
fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache
fs/ocfs2/super.c ocfs2_inode_cache
fs/proc/inode.c proc_inode_cache
fs/qnx4/inode.c qnx4_inode_cache
fs/reiserfs/super.c reiser_inode_cache
fs/romfs/inode.c romfs_inode_cache
fs/smbfs/inode.c smb_inode_cache
fs/sysv/inode.c sysv_inode_cache
fs/udf/super.c udf_inode_cache
fs/ufs/super.c ufs_inode_cache
net/socket.c sock_inode_cache
net/sunrpc/rpc_pipe.c rpc_inode_cache
The choice of which slab caches to so mark was quite simple. I marked
those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
inode_cache, and buffer_head, which were marked in a previous patch. Even
though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
potentially large file system i/o related slab caches as we need for memory
spreading.
Given that the rule now becomes "wherever you would have used a
SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
Future file system writers will just copy one of the existing file system
slab cache setups and tend to get it right without thinking.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Import the NID_cast5_cbc from the userland context. Not used.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use a spinlock to ensure unique sequence numbers when creating krb5 gss tokens.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove unnecessary kmalloc of temporary space to hold the md5 result; it's
small enough to just put on the stack.
This code may be called to process rpc's necessary to perform writes, so
there's a potential deadlock whenever we kmalloc() here. After this a
couple kmalloc()'s still remain, to be removed soon.
This also fixes a rare double-free on error noticed by coverity.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently this will not happen if we exit before rpc_new_task() was called.
Also fix up rpc_run_task() to do the same (for consistency).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to ensure that all writes to the XDR buffers are done before
req->rq_received is visible to other processors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduced by NFS metrics patch.
Test plan:
Compile kernel with CONFIG_NFS enabled on a 64-bit platform.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RPC_DEBUG_DATA no longer needed in net/sunrpc/xprt.c.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: replace rpc_call() helper with direct call to rpc_call_sync.
This makes NFSv2 and NFSv3 synchronous calls more computationally
efficient, and reduces stack consumption in functions that used to
invoke rpc_call more than once.
Test plan:
Compile kernel with CONFIG_NFS enabled. Connectathon on NFS version 2,
version 3, and version 4 mount points.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add fields to the rpc_procinfo struct that allow the display of a
human-readable name for each procedure in the rpc_iostats output.
Also fix it so that the NFSv4 stats are broken up correctly by
sub-procedure number. NFSv4 uses only two real RPC procedures:
NULL, and COMPOUND.
Test plan:
Mount with NFSv2, NFSv3, and NFSv4, and do "cat /proc/self/mountstats".
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a simple mechanism for collecting stats in the RPC client. Stats are
tabulated during xprt_release. Note that per_cpu shenanigans are not
required here because the RPC client already serializes on the transport
write lock.
Test plan:
Compile kernel with CONFIG_NFS enabled. Basic performance regression
testing with high-speed networking and high performance server.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Account for various things that occur while an RPC task is executed.
Separate timers for RPC round trip and RPC execution time show how
long RPC requests wait in queue before being sent. Eventually these
will be accumulated at xprt_release time in one place where they can
be viewed from userland.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Monitor generic transport events. Add a transport switch callout to
format transport counters for export to user-land.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RPC wait queue length will eventually be exported to userland via the RPC
iostats interface.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds server ip address to be printed out when "server
requires stronger authentication" error occured.
Signed-off-by: Levent Serinol <lserinol@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If not, we cannot guarantee that idmap->idmap_dentry, gss_auth->dentry and
clnt->cl_dentry are valid dentries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds a request_module call to rpcauth_create which will try
to auto-load the kernel module for the requested authentication flavor.
For kernels with modular sunrpc, this reduces the admin overhead for
the user.
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In rpc_wake_up() and rpc_wake_up_status(), it is possible for the call to
__rpc_wake_up_task() to fail if another thread happens to be calling
rpc_wake_up_task() on the same rpc_task.
Problem noticed by Bruno Faccini.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The Coverity checker spotted this possible NULL pointer dereference in
rpc_new_client().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes a bug whereby if two processes try to look up the same auth_gss
credential, they may end up creating two creds, and triggering two upcalls
because the upcall is performed before the credential is added to the
credcache.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The function rpc_timeout_upcall_queue runs from a workqueue, and hence
sleeping is not recommended. Convert the protection of the upcall queue
from being mutex-based to being spinlock-based.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we look up a new cred in the auth_gss downcall so that we can stuff
the credcache, we do not want that lookup to queue up an upcall in order
to initialise it. To do an upcall here not only redundant, but since we
are already holding the inode->i_mutex, it will trigger a lock recursion.
This patch allows rpcauth cache searches to indicate that they can cope
with uninitialised credentials.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix the syntax of some kernel-doc comments
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow mechanisms to return more varied errors on the context creation
downcall.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We require the server's gssd to create a completed context before asking the
kernel to send a final context init reply. However, gssd could be buggy, or
under some bizarre circumstances we might purge the context from our cache
before we get the chance to use it here.
Handle this case by returning GSS_S_NO_CONTEXT to the client.
Also move the relevant code here to a separate function rather than nesting
excessively.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kerberos context initiation is handled in a single round trip, but other
mechanisms (including spkm3) may require more, so we need to handle the
GSS_S_CONTINUE case in svcauth_gss_accept. Send a null verifier.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The server code currently keeps track of the destination address on every
request so that it can reply using the same address. However we forget to do
that in the case of a deferred request. Remedy this oversight. >From folks
at PolyServe.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This removes more unneeded casts on the return value for kmalloc(),
sock_kmalloc(), and vmalloc().
Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of needless casting of kmalloc() return value in net/
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to wait for the cl_users to go down to zero, not for it to stay
positive. Quoth Trond (who wasn't even the author, but acked the wrong
version): "Argh! I need to increase my daily caffeine dosages."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert sleep_on() to wait_event_timeout(). Probably safe with the BKL but
could be racy once BKL use in NFS-client is gone.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This changes some simple "if (x) BUG();" statements to "BUG_ON(x);"
Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some long time ago, dentry struct was carefully tuned so that on 32 bits
UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
of memory cache lines.
Then RCU was added and dentry struct enlarged by two pointers, with nice
results for SMP, but not so good on UP, because breaking the above tuning
(128 + 8 = 136 bytes)
This patch reverts this unwanted side effect, by using an union (d_u),
where d_rcu and d_child are placed so that these two fields can share their
memory needs.
At the time d_free() is called (and d_rcu is really used), d_child is known
to be empty and not touched by the dentry freeing.
Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
the previous content of d_child is not needed if said dentry was unhashed
but still accessed by a CPU because of RCU constraints)
As dentry cache easily contains millions of entries, a size reduction is
worth the extra complexity of the ugly C union.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Paul Jackson <pj@sgi.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also update the tokenlen calculations to accomodate g_token_size().
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We ought never to be calling xprt_destroy() if there are still active
rpc_tasks. Optimise away the broken code that attempts to "fix" that case.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server decides to close the RPC socket, we currently don't actually
respond until either another RPC call is scheduled, or until xprt_autoclose()
gets called by the socket expiry timer (which may be up to 5 minutes
later).
This patch ensures that xprt_autoclose() is called much sooner if the
server closes the socket.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
At some point, transport endpoint addresses will no longer be IPv4. To hide
the structure of the rpc_xprt's address field from ULPs and port mappers,
add an API for setting the port number during an RPC bind operation.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
Start by creating an API to force RPC rebind, replacing logic that simply
sets cl_port to zero.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add RPC client transport switch support for replacing buffer management
on a per-transport basis.
In the current IPv4 socket transport implementation, RPC buffers are
allocated as needed for each RPC message that is sent. Some transport
implementations may choose to use pre-allocated buffers for encoding,
sending, receiving, and unmarshalling RPC messages, however. For
transports capable of direct data placement, the buffers can be carved
out of a pre-registered area of memory rather than from a slab cache.
Test-plan:
Millions of fsx operations. Performance characterization with "sio" and
"iozone". Use oprofile and other tools to look for significant regression
in CPU utilization.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes ths unused function xdr_decode_string().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Charles Lever <Charles.Lever@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and make sure that the "intr" flag also enables SIGHUP and SIGTERM to
interrupt RPC calls too (as per the Solaris implementation).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.
The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.
Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I submitted this one previously - svc_tcp_recvfrom currently returns
any errors to the caller, including ECONNRESET and the like.
This is something svc_recv isn't able to deal with:
len = svsk->sk_recvfrom(rqstp);
[...]
if (len == 0 || len == -EAGAIN) {
[...]
return -EAGAIN;
}
[...]
return len;
The nfsd main loop will exit when it sees an error code other than
EAGAIN.
The following patch fixes this problem
svc_recv is not equipped to deal with error codes other than EAGAIN,
and will propagate anything else (such as ECONNRESET) up to nfsd,
causing it to exit.
Signed-off-by: Olaf Kirch <okir@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The hash.h hash_long function, when used on a 64 bit machine, ignores many
of the middle-order bits. (The prime chosen it too bit-sparse).
IP addresses for clients of an NFS server are very likely to differ only in
the low-order bits. As addresses are stored in network-byte-order, these
bits become middle-order bits in a little-endian 64bit 'long', and so do
not contribute to the hash. Thus you can have the situation where all
clients appear on one hash chain.
So, until hash_long is fixed (or maybe forever), us a hash function that
works well on IP addresses - xor the bytes together.
Thanks to "Iozone" <capps@iozone.org> for identifying this problem.
Cc: "Iozone" <capps@iozone.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)
This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.
This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)
I made a global stubstitute to change all existing occurences to make
them const.
This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gss_create_upcall() should not error just because rpc.gssd closed the
pipe on its end. Instead, it should requeue the pending requests and then
retry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The elements on rpci->in_upcall are tracked by the filp->private_data,
which will ensure that they get released when the file is closed.
The exception is if rpc_close_pipes() gets called first, since that
sets rpci->ops to NULL.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In __rpc_purge_upcall (net/sunrpc/rpc_pipe.c), the newer code to clean up
the in_upcall list has a typo.
Thanks to Vince Busam <vbusam@google.com> for spotting this!
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>