We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep. This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...
Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit 324d003b0c.
The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_open_permission_mask() should only check MAY_EXEC for files that
are opened with __FMODE_EXEC.
Also fix NFSv4 access-in-open path in a similar way -- openflags must be
used because fmode will not always have FMODE_EXEC set.
This patch fixes https://bugzilla.kernel.org/show_bug.cgi?id=49101
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Use the new FS-Cache invalidation facility from NFS to deal with foreign
changes being detected on the server rather than attempting to retire the old
cookie and get a new one.
The problem with the old method was that NFS did not wait for all outstanding
storage and retrieval ops on the cache to complete. There was no automatic
wait between the calls to ->readpages() and calls to invalidate_inode_pages2()
as the latter can only wait on locked pages that have been added to the
pagecache (which they haven't yet on entry to ->readpages()).
This was leading to oopses like the one below when an outstanding read got cut
off from its cookie by a premature release.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
PGD 15889067 PUD 15890067 PMD 0
Oops: 0000 [#1] SMP
CPU 0
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
Pid: 4544, comm: tar Not tainted 3.1.0-rc4-fsdevel+ #1064 /DG965RY
RIP: 0010:[<ffffffffa0075118>] [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
RSP: 0018:ffff8800158799e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800070d41e0 RCX: ffff8800083dc1b0
RDX: 0000000000000000 RSI: ffff880015879960 RDI: ffff88003e627b90
RBP: ffff880015879a28 R08: 0000000000000002 R09: 0000000000000002
R10: 0000000000000001 R11: ffff880015879950 R12: ffff880015879aa4
R13: 0000000000000000 R14: ffff8800083dc158 R15: ffff880015879be8
FS: 00007f671e9d87c0(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000a8 CR3: 000000001587f000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 4544, threadinfo ffff880015878000, task ffff880015875040)
Stack:
ffffffffa00b1759 ffff8800070dc158 ffff8800000213da ffff88002a286508
ffff880015879aa4 ffff880015879be8 0000000000000001 ffff88002a2866e8
ffff880015879a88 ffffffffa00b20be 00000000000200da ffff880015875040
Call Trace:
[<ffffffffa00b1759>] ? nfs_fscache_wait_bit+0xd/0xd [nfs]
[<ffffffffa00b20be>] __nfs_readpages_from_fscache+0x7e/0x13f [nfs]
[<ffffffff81095fe7>] ? __alloc_pages_nodemask+0x156/0x662
[<ffffffffa0098763>] nfs_readpages+0xee/0x187 [nfs]
[<ffffffff81098a5e>] __do_page_cache_readahead+0x1be/0x267
[<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
[<ffffffff81098d7b>] ra_submit+0x1c/0x20
[<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
[<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
[<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
[<ffffffff810c22c4>] do_sync_read+0xba/0xfa
[<ffffffff810a62c9>] ? might_fault+0x4e/0x9e
[<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
[<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
[<ffffffff810c29a4>] vfs_read+0xaa/0x13a
[<ffffffff810c2a79>] sys_read+0x45/0x6c
[<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b
Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
If an RPC call is interrupted, assume that the server hasn't processed
the RPC call so that the next time we use the slot, we know that if we
get a NFS4ERR_SEQ_MISORDERED or NFS4ERR_SEQ_FALSE_RETRY, we just have
to bump the sequence number.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shave a few bytes off the slot table size by moving the RPC timestamp
into the sequence results.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server returns NFS4ERR_SEQ_MISORDERED, it could be a sign
that the slot was retired at some point. Retry the attempt after
reinitialising the slot sequence number to 1.
Also add a handler for NFS4ERR_SEQ_FALSE_RETRY. Just bump the slot
sequence number and retry...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, when an RPCSEC_GSS context has expired or is non-existent
and the users (Kerberos) credentials have also expired or are non-existent,
the client receives the -EKEYEXPIRED error and tries to refresh the context
forever. If an application is performing I/O, or other work against the share,
the application hangs, and the user is not prompted to refresh/establish their
credentials. This can result in a denial of service for other users.
Users are expected to manage their Kerberos credential lifetimes to mitigate
this issue.
Move the -EKEYEXPIRED handling into the RPC layer. Try tk_cred_retry number
of times to refresh the gss_context, and then return -EACCES to the application.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to
respect the server's sr_highest_slotid limit. This mainly happens
due to reordered RPC requests.
The way to handle it is simply to drop the slot that we're using,
and retry using the new highest_slotid limits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 1f1ea6c "NFSv4: Fix buffer overflow checking in
__nfs4_get_acl_uncached" accidently dropped the checking for too small
result buffer length.
If someone uses getxattr on "system.nfs4_acl" on an NFSv4 mount
supporting ACLs, the ACL has not been cached and the buffer suplied is
too short, we still copy the complete ACL, resulting in kernel and user
space memory corruption.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, we see a lot of bouncing for the value of highest_used_slotid
due to the fact that slots are getting freed, instead of getting instantly
transmitted to the next waiting task.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to preserve the rpc_task priority for things like writebacks,
that may have differing levels of urgency.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All it does is pass its arguments through to another function. Let's
cut out the middleman...
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Privileged rpc calls are those that are run by the state recovery thread,
in cases where we're trying to recover the system after a server reboot
or a network partition. In those cases, we want to fence off all other
rpc calls (see nfs4_begin_drain_session()) so that they don't end up
using stateids or clientids that are in the process of being recovered.
Prior to this patch, we had to set up special callback functions in
order to declare an rpc call as being privileged.
By adding a new field to the sequence arguments, this patch simplifies
things considerably, and allows us to declare the rpc call as privileged
before it is run.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is more important to preserve the task priority behaviour, which ensures
that things like reclaim writes take precedence over background and kupdate
writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We shouldn't need to pass the 'cache_reply' parameter if we
initialise the sequence_args/sequence_res in the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nobody calls nfs4_setup_sequence or nfs41_setup_sequence without
also calling rpc_call_start() on success. This commit therefore
folds the rpc_call_start call into nfs41_setup_sequence().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no point in using nfs4_setup_sequence or nfs4_sequence_done
in pure NFSv4.1 functions. We already know that those have sessions...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server requests a lower target_highest_slotid, then ensure
that we ping it with at least one RPC call containing an
appropriate SEQUENCE op. This ensures that the server won't need to
send a recall callback in order to shrink the slot table.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_wait_clnt_recover and nfs4_client_recover_expired_lease are both
generic state related functions. As such, they belong in nfs4state.c,
and not nfs4proc.c
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Coalesce nfs4_check_drain_bc_complete and nfs4_check_drain_fc_complete
into a single function that can be called when the slot table is known
to be empty, then change nfs4_callback_free_slot() and nfs4_free_slot()
to use it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the NFSv4.1 session slot allocation fails due to an ENOMEM condition,
then set the task->tk_timeout to 1/4 second to ensure that we do retry
the slot allocation more quickly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The state manager no longer needs any special machinery to stop the
session flow and resize the slot table. It is all done on the fly by
the SEQUENCE op code now.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of an array of slots, use a singly linked list of slots that
can be dynamically appended to or shrunk.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow the server to control the size of the session slot table
by adjusting the value of sr_target_max_slots in the reply to the
SEQUENCE operation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that the NFSv4.1 CB_RECALL_SLOT callback updates the slot table
target max slotid safely.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the server tells us that it is dynamically resizing the session
replay cache, we should reset the sequence number for those slots
that have been deallocated.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Dynamic slot allocation in NFSv4.1 depends on the client being able to
track the server's target value for the highest slotid in the
slot table. See the reference in Section 2.10.6.1 of RFC5661.
To avoid ordering problems in the case where 2 SEQUENCE replies contain
conflicting updates to this target value, we also introduce a generation
counter, to track whether or not an RPC containing a SEQUENCE operation
was launched before or after the last update.
Also rename the nfs4_slot_table target_max_slots field to
'target_highest_slotid' to avoid confusion with a slot
table size or number of slots.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change the argument to take the pointer to the slot, instead of
just the slotid.
We know that the new value of highest_used_slot must be less than
the current value. No need to scan the whole table.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up the NFSv4.1 slot allocation by replacing nfs_find_slot() with
a function nfs_alloc_slot() that returns a pointer to the nfs4_slot
instead of an offset into the slot table.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of doing slot table pointer gymnastics every time we want to
know which slot we're using.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the session pointer into the slot table, then have struct nfs4_slot
point to that slot table.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We must always bump the clientid sequence number after a successful
call to CREATE_SESSION on the server. The result of
nfs4_verify_channel_attrs() is irrelevant to that requirement.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're mounting a new filesystem, ensure that the session has negotiated
large enough request and reply sizes to match the wsize and rsize mount
arguments.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't store the target request and response sizes in the same
variables used to store the server's replies to those targets.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If I mount an NFS v4.1 server to a single client multiple times and then
run xfstests over each mountpoint I usually get the client into a state
where recovery deadlocks. The server informs the client of a
cb_path_down sequence error, the client then does a
bind_connection_to_session and checks the status of the lease.
I found that bind_connection_to_session sets the NFS4_SESSION_DRAINING
flag on the client, but this flag is never unset before
nfs4_check_lease() reaches nfs4_proc_sequence(). This causes the client
to deadlock, halting all NFS activity to the server. nfs4_proc_sequence()
is only called by the state manager, so I can change it to run in privileged
mode to bypass the NFS4_SESSION_DRAINING check and avoid the deadlock.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Return errno - not an NFS4ERR_. This worked because NFS4ERR_ACCESS == EACCES.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_sb_deactive_async instead of nfs_sb_deactive when in a workqueue
context. This avoids a deadlock where rpc_shutdown_client loops forever
in a workqueue kworker context, trying to kill all RPC tasks associated with
the client, while one or more of these tasks have already been assigned to the
same kworker (and will never run rpc_exit_task).
This approach is needed because RPC tasks that have already been assigned
to a kworker by queue_work cannot be canceled, as explained in the comment
for workqueue.c:insert_wq_barrier.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
[Trond: add module_get/put.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the state recovery machinery is triggered by the call to
nfs4_async_handle_error() then we can deadlock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
If we do not release the sequence id in cases where we fail to get a
session slot, then we can deadlock if we hit a recovery scenario.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Currently, we will schedule session recovery and then return to the
caller of nfs4_handle_exception. This works for most cases, but causes
a hang on the following test case:
Client Server
------ ------
Open file over NFS v4.1
Write to file
Expire client
Try to lock file
The server will return NFS4ERR_BADSESSION, prompting the client to
schedule recovery. However, the client will continue placing lock
attempts and the open recovery never seems to be scheduled. The
simplest solution is to wait for session recovery to run before retrying
the lock.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
nfs4_open_recover_helper zeros the nfs4_opendata result structures, removing
the result access_request information which leads to an XDR decode error.
Move the setting of the result access_request field to nfs4_init_opendata_res
which sets all the other required nfs4_opendata result fields and is shared
between the open and recover open paths.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We currently make no distinction in attribute requests between normal OPENs
and OPEN with CLAIM_PREVIOUS. This offers more possibility of failures in
the GETATTR response which foils OPEN reclaim attempts.
Reduce the requested attributes to the bare minimum needed to update the
reclaim open stateid and split nfs4_opendata_to_nfs4_state processing
accordingly.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't put an ACCESS op in OPEN compound if O_EXCL, because ACCESS
will return permission denied for all bits until close.
Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't check MAY_WRITE as a newly created file may not have write mode bits,
but POSIX allows the creating process to write regardless.
This is ok because NFSv4 OPEN ops handle write permissions correctly -
the ACCESS in the OPEN compound is to differentiate READ v EXEC permissions.
Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the layoutget call returns a stateid error, we want to invalidate the
layout stateid, and/or recover the open stateid.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
After commit e38eb650 (NFS: set_pnfs_layoutdriver() from
nfs4_proc_fsinfo()), set_pnfs_layoutdriver() is called inside
nfs4_proc_fsinfo(), but pnfs_blksize is not set. It causes setting
blocklayoutdriver failure and pnfsblock mount failure.
Cc: stable <stable@vger.kernel.org> [since v3.5]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An optional boot parameter is introduced to allow client
administrators to specify a string that the Linux NFS client can
insert into its nfs_client_id4 id string, to make it both more
globally unique, and to ensure that it doesn't change even if the
client's nodename changes.
If this boot parameter is not specified, the client's nodename is
used, as before.
Client installation procedures can create a unique string (typically,
a UUID) which remains unchanged during the lifetime of that client
instance. This works just like creating a UUID for the label of the
system's root and boot volumes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
"Server trunking" is a fancy named for a multi-homed NFS server.
Trunking might occur if a client sends NFS requests for a single
workload to multiple network interfaces on the same server. There
are some implications for NFSv4 state management that make it useful
for a client to know if a single NFSv4 server instance is
multi-homed. (Note this is only a consideration for NFSv4, not for
legacy versions of NFS, which are stateless).
If a client cares about server trunking, no NFSv4 operations can
proceed until that client determines who it is talking to. Thus
server IP trunking discovery must be done when the client first
encounters an unfamiliar server IP address.
The nfs_get_client() function walks the nfs_client_list and matches
on server IP address. The outcome of that walk tells us immediately
if we have an unfamiliar server IP address. It invokes
nfs_init_client() in this case. Thus, nfs4_init_client() is a good
spot to perform trunking discovery.
Discovery requires a client to establish a fresh client ID, so our
client will now send SETCLIENTID or EXCHANGE_ID as the first NFS
operation after a successful ping, rather than waiting for an
application to perform an operation that requires NFSv4 state.
The exact process for detecting trunking is different for NFSv4.0 and
NFSv4.1, so a minorversion-specific init_client callout method is
introduced.
CLID_INUSE recovery is important for the trunking discovery process.
CLID_INUSE is a sign the server recognizes the client's nfs_client_id4
id string, but the client is using the wrong principal this time for
the SETCLIENTID operation. The SETCLIENTID must be retried with a
series of different principals until one works, and then the rest of
trunking discovery can proceed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, when identifying itself to NFS servers, the Linux NFS
client uses a unique nfs_client_id4.id string for each server IP
address it talks with. For example, when client A talks to server X,
the client identifies itself using a string like "AX". The
requirements for these strings are specified in detail by RFC 3530
(and bis).
This form of client identification presents a problem for Transparent
State Migration. When client A's state on server X is migrated to
server Y, it continues to be associated with string "AX." But,
according to the rules of client string construction above, client
A will present string "AY" when communicating with server Y.
Server Y thus has no way to know that client A should be associated
with the state migrated from server X. "AX" is all but abandoned,
interfering with establishing fresh state for client A on server Y.
To support transparent state migration, then, NFSv4.0 clients must
instead use the same nfs_client_id4.id string to identify themselves
to every NFS server; something like "A".
Now a client identifies itself as "A" to server X. When a file
system on server X transitions to server Y, and client A identifies
itself as "A" to server Y, Y will know immediately that the state
associated with "A," whether it is native or migrated, is owned by
the client, and can merge both into a single lease.
As a pre-requisite to adding support for NFSv4 migration to the Linux
NFS client, this patch changes the way Linux identifies itself to NFS
servers via the SETCLIENTID (NFSv4 minor version 0) and EXCHANGE_ID
(NFSv4 minor version 1) operations.
In addition to removing the server's IP address from nfs_client_id4,
the Linux NFS client will also no longer use its own source IP address
as part of the nfs_client_id4 string. On multi-homed clients, the
value of this address depends on the address family and network
routing used to contact the server, thus it can be different for each
server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The OPEN operation has no way to differentiate an open for read and an
open for execution - both look like read to the server. This allowed
users to read files that didn't have READ access but did have EXEC access,
which is obviously wrong.
This patch adds an ACCESS call to the OPEN compound to handle the
difference between OPENs for reading and execution. Since we're going
through the trouble of calling ACCESS, we check all possible access bits
and cache the results hopefully avoiding an ACCESS call in the future.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I put the client into an open recovery loop by:
Client: Open file
read half
Server: Expire client (echo 0 > /sys/kernel/debug/nfsd/forget_clients)
Client: Drop vm cache (echo 3 > /proc/sys/vm/drop_caches)
finish reading file
This causes a loop because the client never updates the nfs4_state after
discovering that the delegation is invalid. This means it will keep
trying to read using the bad delegation rather than attempting to re-open
the file.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
CC: stable@vger.kernel.org [3.4+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we are reading through a delegation, and the delegation is OK then
state->stateid will still point to a delegation stateid and not an open
stateid.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently it does not do so if the RPC call failed to start. Fix is to
move the decrement of plh_block_lgets into nfs4_layoutreturn_release.
Also remove a redundant test of task->tk_status in nfs4_layoutreturn_done:
if lrp->res.lrs_present is set, then obviously the RPC call succeeded.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we sleep after dropping the inode->i_lock, then we are no longer
atomic with respect to the rpc_wake_up() call in pnfs_layout_remove_lseg().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we do return errors from nfs4_proc_layoutget() and that we
don't mark the layout as having failed if the error was due to a
signal or resource problem on the client side.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to be able to pass on the information that the page was not
dirtied under a lock. Instead of adding a flag parameter, do this
by passing a pointer to a 'struct nfs_lock_owner' that may be NULL.
Also reuse this structure in struct nfs_lock_context to carry the
fl_owner_t and pid_t.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In case of error, the function rpcauth_create() returns ERR_PTR()
and never returns NULL pointer. The NULL test in the return value
check should be replaced with IS_ERR().
dpatch engine is used to auto generated this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the checks made by decode_getacl back to __nfs4_get_acl_uncached
so that it knows if the acl has been truncated.
The current overflow checking is broken, resulting in Oopses on
user-triggered nfs4_getfacl calls, and is opaque to the point
where several attempts at fixing it have failed.
This patch tries to clean up the code in addition to fixing the
Oopses by ensuring that the overflow checks are performed in
a single place (decode_getacl). If the overflow check failed,
we will still be able to report the acl length, but at least
we will no longer attempt to cache the acl or copy the
truncated contents to user space.
Reported-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Sachin Prabhu <sprabhu@redhat.com>
Ensure that the user supplied buffer size doesn't cause us to overflow
the 'pages' array.
Also fix up some confusion between the use of PAGE_SIZE and
PAGE_CACHE_SIZE when calculating buffer sizes. We're not using
the page cache for anything here.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the NFS_COOKIEVERF helper macro was converted into a static
inline function in commit 99fadcd764 (nfs: convert NFS_*(inode)
helpers to static inline), we broke the initialisation of the
readdir cookies, since that depended on doing a memset with an
argument of 'sizeof(NFS_COOKIEVERF(inode))' which therefore
changed from sizeof(be32 cookieverf[2]) to sizeof(be32 *).
At this point, NFS_COOKIEVERF seems to be more of an obfuscation
than a helper, so the best thing would be to just get rid of it.
Also see: https://bugzilla.kernel.org/show_bug.cgi?id=46881
Reported-by: Andi Kleen <andi@firstfloor.org>
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Currently, we do not take into account the size of the 16 byte
struct nfs4_cached_acl header, when deciding whether or not we should
cache the acl data. Consequently, we will end up allocating an
8k buffer in order to fit a maximum size 4k acl.
This patch adjusts the calculation so that we limit the cache size
to 4k for the acl header+data.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Resetting the cursor xdr->p to a previous value is not a safe
practice: if the xdr_stream has crossed out of the initial iovec,
then a bunch of other fields would need to be reset too.
Fix this issue by using xdr_enter_page() so that the buffer gets
page aligned at the bitmap _before_ we decode it.
Also fix the confusion of the ACL length with the page buffer length
by not adding the base offset to the ACL length...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
disconnected data server) we've been sending layoutreturn calls
while there is potentially still outstanding I/O to the data
servers. The reason we do this is to avoid races between replayed
writes to the MDS and the original writes to the DS.
When this happens, the BUG_ON() in nfs4_layoutreturn_done can
be triggered because it assumes that we would never call
layoutreturn without knowing that all I/O to the DS is
finished. The fix is to remove the BUG_ON() now that the
assumptions behind the test are obsolete.
Reported-by: Boaz Harrosh <bharrosh@panasas.com>
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.5]
since the only user of nfs4_proc_layoutget is send_layoutget, which
ignores its return value, there is no reason to return any value.
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
we have encountered a bug whereby reading a lot of files (copying
fedora's /bin) from a pNFS mount and hitting Ctrl+C in the middle caused
a general protection fault in xdr_shrink_bufhead. this function is
called when decoding the response from LAYOUTGET. the decoding is done
by a worker thread, and the caller of LAYOUTGET waits for the worker
thread to complete.
hitting Ctrl+C caused the synchronous wait to end and the next thing the
caller does is to free the pages, so when the worker thread calls
xdr_shrink_bufhead, the pages are gone. therefore, the cleanup of these
pages has been moved to nfs4_layoutget_release.
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
v2 and v4 don't use it, so I create two new nfs_rpc_ops functions to
initialize the ACL client only when we are using v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I'm already looking up the nfs subversion in nfs_fs_mount(), so I have
easy access to rpc_ops that used to be difficult to reach. This allows
me to set up a different mount path for NFS v2/3 and NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fl_type is not a bitmap.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS v4 file inode operations are already already in nfs4proc.c, so
this patch just needs to move the directory operations to the same file.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For NFSv4 minor version 0, currently the cl_id_uniquifier allows the
Linux client to generate a unique nfs_client_id4 string whenever a
server replies with NFS4ERR_CLID_INUSE.
This implementation seems to be based on a flawed reading of RFC
3530. NFS4ERR_CLID_INUSE actually means that the client has presented
this nfs_client_id4 string with a different principal at some time in
the past, and that lease is still in use on the server.
For a Linux client this might be rather difficult to achieve: the
authentication flavor is named right in the nfs_client_id4.id
string. If we change flavors, we change strings automatically.
So, practically speaking, NFS4ERR_CLID_INUSE means there is some other
client using our string. There is not much that can be done to
recover automatically. Let's make it a permanent error.
Remove the recovery logic in nfs4_proc_setclientid(), and remove the
cl_id_uniquifier field from the nfs_client data structure. And,
remove the authentication flavor from the nfs_client_id4 string.
Keeping the authentication flavor in the nfs_client_id4.id string
means that we could have a separate lease for each authentication
flavor used by mounts on the client. But we want just one lease for
all the mounts on this client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 state recovery is not always successful. Failure is signalled
by setting the nfs_client.cl_cons_state to a negative (errno) value,
then waking waiters.
Currently this can happen only during mount processing. I'm about to
add an explicit case where state recovery failure during normal
operation should force all NFS requests waiting on that state recovery
to exit.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The gss_mech_list_pseudoflavors() function provides a list of
currently registered GSS pseudoflavors. This list does not include
any non-GSS flavors that have been registered with the RPC client.
nfs4_find_root_sec() currently adds these extra flavors by hand.
Instead, nfs4_find_root_sec() should be looking at the set of flavors
that have been explicitly registered via rpcauth_register(). And,
other areas of code will soon need the same kind of list that
contains all flavors the kernel currently knows about (see below).
Rather than cloning the open-coded logic in nfs4_find_root_sec() to
those new places, introduce a generic RPC function that generates a
full list of registered auth flavors and pseudoflavors.
A new rpc_authops method is added that lists a flavor's
pseudoflavors, if it has any. I encountered an interesting module
loader loop when I tried to get the RPC client to invoke
gss_mech_list_pseudoflavors() by name.
This patch is a pre-requisite for server trunking discovery, and a
pre-requisite for fixing up the in-kernel mount client to do better
automatic security flavor selection.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Squelch compiler warnings:
fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’:
fs/nfs/nfs4proc.c:3811:14: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
fs/nfs/nfs4proc.c:3818:15: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Introduced by commit bf118a34 "NFSv4: include bitmap in nfsv4 get
acl data", Dec 7, 2011.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As a finishing touch, add appropriate documenting comments and some
debugging printk's.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Instead of open-coded flag manipulation, use test_bit() and
clear_bit() just like all other accessors of the state->flag field.
This also eliminates several unnecessary implicit integer type
conversions.
To make it absolutely clear what is going on, a number of comments
are introduced.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The "state->flags & flags" test in nfs41_check_expired_stateid()
allows the state manager to squelch a TEST_STATEID operation when
it is known for sure that a state ID is no longer valid. If the
lease was purged, for example, the client already knows that state
ID is now defunct.
But open recovery is still needed for that inode.
To force a call to nfs4_open_expired(), change the default return
value for nfs41_check_expired_stateid() to force open recovery, and
the default return value for nfs41_check_locks() to force lock
recovery, if the requested flags are clear. Fix suggested by Bryan
Schumaker.
Also, the presence of a delegation state ID must not prevent normal
open recovery. The delegation state ID must be cleared if it was
revoked, but once cleared I don't think it's presence or absence has
any bearing on whether open recovery is still needed. So the logic
is adjusted to ignore the TEST_STATEID result for the delegation
state ID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The result of a TEST_STATEID operation can indicate a few different
things:
o If NFS_OK is returned, then the client can continue using the
state ID under test, and skip recovery.
o RFC 5661 says that if the state ID was revoked, then the client
must perform an explicit FREE_STATEID before trying to re-open.
o If the server doesn't recognize the state ID at all, then no
FREE_STATEID is needed, and the client can immediately continue
with open recovery.
Let's err on the side of caution: if the server clearly tells us the
state ID is unknown, we skip the FREE_STATEID. For any other error,
we issue a FREE_STATEID. Sometimes that FREE_STATEID will be
unnecessary, but leaving unused state IDs on the server needlessly
ties up resources.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The TEST_STATEID and FREE_STATEID operations can return
-NFS4ERR_BAD_STATEID, -NFS4ERR_OLD_STATEID, or -NFS4ERR_DEADSESSION.
nfs41_{test,free}_stateid() should not pass these errors to
nfs4_handle_exception() during state recovery, since that will
recursively kick off state recovery again, resulting in a deadlock.
In particular, when the TEST_STATEID operation returns NFS4_OK,
res.status can contain one of these errors. _nfs41_test_stateid()
replaces NFS4_OK with the value in res.status, which is then returned
to callers.
But res.status is not passed through nfs4_stat_to_errno(), and thus is
a positive NFS4ERR value. Currently callers are only interested in
!NFS4_OK, and nfs4_handle_exception() ignores positive values.
Thus the res.status values are currently ignored by
nfs4_handle_exception() and won't cause the deadlock above. Thanks to
this missing negative, it is only when these operations fail (which
is very rare) that a deadlock can occur.
Bryan agrees the original intent was to return res.status as a
negative NFS4ERR value to callers of nfs41_test_stateid().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't pass nfs_open_context() to ->create(). Only the NFS4 implementation
needed that and only because it wanted to return an open file using open
intents. That task has been replaced by ->atomic_open so it is not necessary
anymore to pass the context to the create rpc operation.
Despite nfs4_proc_create apparently being okay with a NULL context it Oopses
somewhere down the call chain. So allocate a context here.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pNFS needs to select a write function based on the layout driver
currently in use, so I let each NFS version decide how to best handle
initializing writes.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
pNFS needs to select a read function based on the layout driver
currently in use, so I let each NFS version decide how to best handle
initializing reads.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>