AFS server records get removed from the net->fs_servers tree when
they're deleted, but not from the net->fs_addresses{4,6} lists, which
can lead to an oops in afs_find_server() when a server record has been
removed, for instance during rmmod.
Fix this by deleting the record from the by-address lists before posting
it for RCU destruction.
The reason this hasn't been noticed before is that the fileserver keeps
probing the local cache manager, thereby keeping the service record
alive, so the oops would only happen when a fileserver eventually gets
bored and stops pinging or if the module gets rmmod'd and a call comes
in from the fileserver during the window between the server records
being destroyed and the socket being closed.
The oops looks something like:
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
...
Workqueue: kafsd afs_process_async_call [kafs]
RIP: 0010:afs_find_server+0x271/0x36f [kafs]
...
Call Trace:
afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
afs_deliver_to_call+0x1ee/0x5e8 [kafs]
afs_process_async_call+0x5b/0xd0 [kafs]
process_one_work+0x2c2/0x504
worker_thread+0x1d4/0x2ac
kthread+0x11f/0x127
ret_from_fork+0x24/0x30
Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix warnings raised by checker, including:
(*) Warnings raised by unequal comparison for the purposes of sorting,
where the endianness doesn't matter:
fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer
fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer
fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer
fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer
(*) afs_set_cb_interest() is not actually used and can be removed.
(*) afs_cell_gc_delay() should be provided with a sysctl.
(*) afs_cell_destroy() needs to use rcu_access_pointer() to read
cell->vl_addrs.
(*) afs_init_fs_cursor() should be static.
(*) struct afs_vnode::permit_cache needs to be marked __rcu.
(*) afs_server_rcu() needs to use rcu_access_pointer().
(*) afs_destroy_server() should use rcu_access_pointer() on
server->addresses as the server object is no longer accessible.
(*) afs_find_server() casts __be16/__be32 values to int in order to
directly compare them for the purpose of finding a match in a list,
but is should also annotate the cast with __force to avoid checker
warnings.
(*) afs_check_permit() accesses vnode->permit_cache outside of the RCU
readlock, though it doesn't then access the value; the extraneous
access is deleted.
False positives:
(*) Conditional locking around the code in xdr_decode_AFSFetchStatus. This
can be dealt with in a separate patch.
fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block
(*) Incorrect handling of seq-retry lock context balance:
fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different
lock contexts for basic block
fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block
fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block
Errors:
(*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go
round again if it successfully found the workstation cell.
(*) Fix UUID decode in afs_deliver_cb_probe_uuid().
(*) afs_cache_permit() has a missing rcu_read_unlock() before one of the
jumps to the someone_else_changed_it label. Move the unlock to after
the label.
(*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when
encoding to XDR.
(*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl()
when decoding from XDR.
Signed-off-by: David Howells <dhowells@redhat.com>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
YFS VL servers offer an upgraded Volume Location service that can return
IPv6 addresses to fileservers and volume servers in addition to IPv4
addresses using the YFSVL.GetEndpoints operation which we should use if
it's available.
To this end:
(1) Make rxrpc_kernel_recv_data() return the call's current service ID so
that the caller can detect service upgrade and see what the service
was upgraded to.
(2) When we see a VL server address we haven't seen before, send a
VL.GetCapabilities operation to it with the service upgrade bit set.
If we get an upgrade to the YFS VL service, change the service ID in
the address list for that address to use the upgraded service and set
a flag to note that this appears to be a YFS-compatible server.
(3) If, when a server's addresses are being looked up, we note that we
previously detected a YFS-compatible server, then send the
YFSVL.GetEndpoints operation rather than VL.GetAddrsU.
(4) Build a fileserver address list from the reply of YFSVL.GetEndpoints,
including both IPv4 and IPv6 addresses. Volume server addresses are
discarded.
(5) The address list is sorted by address and port now, instead of just
address. This allows multiple servers on the same host sitting on
different ports.
Signed-off-by: David Howells <dhowells@redhat.com>
The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other. Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers. The difference is purely in the database attached to the VL
servers.
The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.
Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).
To this end, the following structural changes are made:
(1) Server record management is overhauled:
(a) Server records are made independent of cell. The namespace keeps
track of them, volume records have lists of them and each vnode
has a server on which its callback interest currently resides.
(b) The cell record no longer keeps a list of servers known to be in
that cell.
(c) The server records are now kept in a flat list because there's no
single address to sort on.
(d) Server records are now keyed by their UUID within the namespace.
(e) The addresses for a server are obtained with the VL.GetAddrsU
rather than with VL.GetEntryByName, using the server's UUID as a
parameter.
(f) Cached server records are garbage collected after a period of
non-use and are counted out of existence before purging is allowed
to complete. This protects the work functions against rmmod.
(g) The servers list is now in /proc/fs/afs/servers.
(2) Volume record management is overhauled:
(a) An RCU-replaceable server list is introduced. This tracks both
servers and their coresponding callback interests.
(b) The superblock is now keyed on cell record and numeric volume ID.
(c) The volume record is now tied to the superblock which mounts it,
and is activated when mounted and deactivated when unmounted.
This makes it easier to handle the cache cookie without causing a
double-use in fscache.
(d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
to get the server UUID list.
(e) The volume name is updated if it is seen to have changed when the
volume is updated (the update is keyed on the volume ID).
(3) The vlocation record is got rid of and VLDB records are no longer
cached. Sufficient information is stored in the volume record, though
an update to a volume record is now no longer shared between related
volumes (volumes come in bundles of three: R/W, R/O and backup).
and the following procedural changes are made:
(1) The fileserver cursor introduced previously is now fleshed out and
used to iterate over fileservers and their addresses.
(2) Volume status is checked during iteration, and the server list is
replaced if a change is detected.
(3) Server status is checked during iteration, and the address list is
replaced if a change is detected.
(4) The abort code is saved into the address list cursor and -ECONNABORTED
returned in afs_make_call() if a remote abort happened rather than
translating the abort into an error message. This allows actions to
be taken depending on the abort code more easily.
(a) If a VMOVED abort is seen then this is handled by rechecking the
volume and restarting the iteration.
(b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
handled by sleeping for a short period and retrying and/or trying
other servers that might serve that volume. A message is also
displayed once until the condition has cleared.
(c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
moment.
(d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
see if it has been deleted; if not, the fileserver is probably
indicating that the volume couldn't be attached and needs
salvaging.
(e) If statfs() sees one of these aborts, it does not sleep, but
rather returns an error, so as not to block the umount program.
(5) The fileserver iteration functions in vnode.c are now merged into
their callers and more heavily macroised around the cursor. vnode.c
is removed.
(6) Operations on a particular vnode are serialised on that vnode because
the server will lock that vnode whilst it operates on it, so a second
op sent will just have to wait.
(7) Fileservers are probed with FS.GetCapabilities before being used.
This is where service upgrade will be done.
(8) A callback interest on a fileserver is set up before an FS operation
is performed and passed through to afs_make_call() so that it can be
set on the vnode if the operation returns a callback. The callback
interest is passed through to afs_iget() also so that it can be set
there too.
In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.
Notes:
(1) Pre AFS-3.4 servers are no longer supported, though this can be added
back if necessary (AFS-3.4 was released in 1998).
(2) VBUSY is retried forever for the moment at intervals of 1s.
(3) /proc/fs/afs/<cell>/servers no longer exists.
Signed-off-by: David Howells <dhowells@redhat.com>
Add an RCU replaceable address list structure to hold a list of server
addresses. The list also holds the
To this end:
(1) A cell's VL server address list can be loaded directly via insmod or
echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
or SRV records.
(2) Anyone wanting to use a cell's VL server address must wait until the
cell record comes online and has tried to obtain some addresses.
(3) An FS server's address list, for the moment, has a single entry that
is the key to the server list. This will change in the future when a
server is instead keyed on its UUID and the VL.GetAddrsU operation is
used.
(4) An 'address cursor' concept is introduced to handle iteration through
the address list. This is passed to the afs_make_call() as, in the
future, stuff (such as abort code) that doesn't outlast the call will
be returned in it.
In the future, we might want to annotate the list with information about
how each address fares. We might then want to propagate such annotations
over address list replacement.
Whilst we're at it, we allow IPv6 addresses to be specified in
colon-delimited lists by enclosing them in square brackets.
Signed-off-by: David Howells <dhowells@redhat.com>
Overhaul the AFS callback handling by the following means:
(1) Don't give up callback promises on vnodes that we are no longer using,
rather let them just expire on the server or let the server break
them. This is actually more efficient for the server as the callback
lookup is expensive if there are lots of extant callbacks.
(2) Only give up the callback promises we have from a server when the
server record is destroyed. Then we can just give up *all* the
callback promises on it in one go.
(3) Servers can end up being shared between cells if cells are aliased, so
don't add all the vnodes being backed by a particular server into a
big FID-indexed tree on that server as there may be duplicates.
Instead have each volume instance (~= superblock) register an interest
in a server as it starts to make use of it and use this to allow the
processor for callbacks from the server to find the superblock and
thence the inode corresponding to the FID being broken by means of
ilookup_nowait().
(4) Rather than iterating over the entire callback list when a mass-break
comes in from the server, maintain a counter of mass-breaks in
afs_server (cb_seq) and make afs_validate() check it against the copy
in afs_vnode.
It would be nice not to have to take a read_lock whilst doing this,
but that's tricky without using RCU.
(5) Save a ref on the fileserver we're using for a call in the afs_call
struct so that we can access its cb_s_break during call decoding.
(6) Write-lock around callback and status storage in a vnode and read-lock
around getattr so that we don't see the status mid-update.
This has the following consequences:
(1) Data invalidation isn't seen until someone calls afs_validate() on a
vnode. Unfortunately, we need to use a key to query the server, but
getting one from a background thread is tricky without caching loads
of keys all over the place.
(2) Mass invalidation isn't seen until someone calls afs_validate().
(3) Callback breaking is going to hit the inode_hash_lock quite a bit.
Could this be replaced with rcu_read_lock() since inodes are destroyed
under RCU conditions.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow VL server specifications to be given IPv6 addresses as well as IPv4
addresses, for example as:
echo add foo.org 1111:2222:3333:0:4444:5555:6666:7777 >/proc/fs/afs/cells
Note that ':' is the expected separator for separating IPv4 addresses, but
if a ',' is detected or no '.' is detected in the string, the delimiter is
switched to ','.
This also works with DNS AFSDB or SRV record strings fetched by upcall from
userspace.
Signed-off-by: David Howells <dhowells@redhat.com>
Keep and pass sockaddr_rxrpc addresses around rather than keeping and
passing in_addr addresses to allow for the use of IPv6 and non-standard
port numbers in future.
This also allows the port and service_id fields to be removed from the
afs_call struct.
Signed-off-by: David Howells <dhowells@redhat.com>
Push the network namespace pointer to more places in AFS, including the
afs_server structure (which doesn't hold a ref on the netns).
In particular, afs_put_cell() now takes requires a net ns parameter so that
it can safely alter the netns after decrementing the cell usage count - the
cell will be deallocated by a background thread after being cached for a
period, which means that it's not safe to access it after reducing its
usage count.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix server reaping and make sure it's all done before we start trying to
purge cells, given that servers currently pin cells.
Signed-off-by: David Howells <dhowells@redhat.com>
Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.
The following changes have been made:
(1) Store the netns in the superblock info. This will be obtained from
the mounter's nsproxy on a manual mount and inherited from the parent
superblock on an automount.
(2) The cell list is made per-netns. It can be viewed through
/proc/net/afs/cells and also be modified by writing commands to that
file.
(3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
This is unset by default.
(4) The 'rootcell' module parameter, which sets a cell and VL server list
modifies the init net namespace, thereby allowing an AFS root fs to be
theoretically used.
(5) The volume location lists and the file lock manager are made
per-netns.
(6) The AF_RXRPC socket and associated I/O bits are made per-ns.
The various workqueues remain global for the moment.
Changes still to be made:
(1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
from the old name.
(2) A per-netns subsys needs to be registered for AFS into which it can
store its per-netns data.
(3) Rather than the AF_RXRPC socket being opened on module init, it needs
to be opened on the creation of a superblock in that netns.
(4) The socket needs to be closed when the last superblock using it is
destroyed and all outstanding client calls on it have been completed.
This prevents a reference loop on the namespace.
(5) It is possible that several namespaces will want to use AFS, in which
case each one will need its own UDP port. These can either be set
through /proc/net/afs/cm_port or the kernel can pick one at random.
The init_ns gets 7001 by default.
Other issues that need resolving:
(1) The DNS keyring needs net-namespacing.
(2) Where do upcalls go (eg. DNS request-key upcall)?
(3) Need something like open_socket_in_file_ns() syscall so that AFS
command line tools attempting to operate on an AFS file/volume have
their RPC calls go to the right place.
Signed-off-by: David Howells <dhowells@redhat.com>
get_seconds() returns real wall-clock seconds. On 32-bit systems
this value will overflow in year 2038 and beyond. This patch changes
afs's vlocation record to use ktime_get_real_seconds() instead, for the
fields time_of_death and update_at.
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:
void rxrpc_kernel_get_peer(struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.
Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.
Signed-off-by: David Howells <dhowells@redhat.com>
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().
Most conversions are straight-forward. Ones worth mentioning are,
* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
use mod_delayed_work() and cancel loop in
edac_mc_reset_delay_period() is dropped.
* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
watchdog is active or not. @fan_watchdog_active and related code
dropped.
* drivers/power/charger-manager.c: Seemingly a lot of
delayed_work_pending() abuse going on here.
[delayed_]work_pending() are unsynchronized and racy when used like
this. I converted one instance in fullbatt_handler(). Please
conver the rest so that it invokes workqueue APIs for the intended
target state rather than trying to game work item pending state
transitions. e.g. if timer should be modified - call
mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
simplified. Note that round_jiffies() calls in this function are
meaningless. round_jiffies() work on absolute jiffies not delta
delay used by delayed_work.
v2: Tomi pointed out that __cancel_delayed_work() users can't be
safely converted to mod_delayed_work(). They could be calling it
from irq context and if that happens while delayed_work_timer_fn()
is running, it could deadlock. __cancel_delayed_work() users are
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
flush_scheduled_work() is going away. afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works. Add afs_wq and use it as the
flush domain instead of the system workqueue.
Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a possible null pointer dereference in afs_alloc_server(): the server
pointer is NULL if there was an allocation failure, and under such a
condition, we can't dereference it in the _leave() statement.
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch contains the following possible cleanups:
- make the following needlessly global functions static:
- rxrpc.c: afs_send_pages()
- vlocation.c: afs_vlocation_queue_for_updates()
- write.c: afs_writepages_region()
- make the following needlessly global variables static:
- mntpt.c: afs_mntpt_expiry_timeout
- proc.c: afs_vlocation_states[]
- server.c: afs_server_timeout
- vlocation.c: afs_vlocation_timeout
- vlocation.c: afs_vlocation_update_timeout
- #if 0 the following unused function:
- cell.c: afs_get_cell_maybe()
- #if 0 the following unused variables:
- callback.c: afs_vnode_update_timeout
- cmservice.c: struct afs_cm_workqueue
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make some miscellaneous changes to the AFS filesystem:
(1) Assert RCU barriers on module exit to make sure RCU has finished with
callbacks in this module.
(2) Correctly handle the AFS server returning a zero-length read.
(3) Split out data zapping calls into one function (afs_zap_data).
(4) Rename some afs_file_*() functions to afs_*() where they apply to
non-regular files too.
(5) Be consistent about the presentation of volume ID:vnode ID in debugging
output.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for the create, link, symlink, unlink, mkdir, rmdir and
rename VFS operations to the in-kernel AFS filesystem.
Also:
(1) Fix dentry and inode revalidation. d_revalidate should only look at
state of the dentry. Revalidation of the contents of an inode pointed to
by a dentry is now separate.
(2) Fix afs_lookup() to hash negative dentries as well as positive ones.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the in-kernel AFS filesystem use AF_RXRPC instead of the old RxRPC code.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up the AFS sources.
Also remove references to AFS keys. RxRPC keys are used instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B) under fs/.
Cc: Ian Kent <raven@themaw.net>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Hans Reiser <reiserfs-dev@namesys.com>
Cc: Urban Widmark <urban@teststation.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!