Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:
- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.
- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.
- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.
- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"
* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...
My static checker warns that "data_size" could be negative and underflow
the limit check. The code looks suspicious but I don't know if it is a
real bug.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Don't error out with misleading "out of memory"
if the cpu-mask has more bits set than there are CPUs.
Just truncate to nr_cpu_ids implicitly.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
size is always 4096,
page is always device->md_io.page.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
During resync, if we need to block some specific incoming write because
of active resync requests to that same range, we potentially caused
*all* new application writes (to "cold" activity log extents) to block
until this one request has been processed.
Improve the do_submit() logic to
* grab all incoming requests to some "incoming" list
* process this list
- move aside requests that are blocked by resync
- prepare activity log transactions,
- commit transactions and submit corresponding requests
- if there are remaining requests that only wait for
activity log extents to become free, stop the fast path
(mark activity log as "starving")
- iterate until no more requests are waiting for the activity log,
but all potentially remaining requests are only blocked by resync
* only then grab new incoming requests
That way, very busy IO on currently "hot" activity log extents cannot
starve scattered IO to "cold" extents. And blocked-by-resync requests
are processed once resync traffic on the affected region has ceased,
without blocking anything else.
The only blocking mode left is when we cannot start requests to "cold"
extents because all currently "hot" extents are actually used.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The data generation identifiers used to be exposed via sysfs
at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree),
for advanced policy scripting.
Bring that information over to debugfs.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Information of former /sys/block/drbdX/drbd/oldest_requests
is already with higher detail in these files:
debugfs/drbd/resource/$name/in_flight_summary,
debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests
This patch adds
debugfs/drbd/resource/$name/connections/peer/oldest_requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Make the first line of debugfs files a version number,
starting now with "v: 0".
If we change content of presentation, we will bump that.
Monitoring or diagnostic scritps that may parse these files
can then easily know when they need to be reviewed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Show oldest requests
* pending master bio completion and,
* if different, local disk bio completion.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Add a per-connection worker thread callback_history
with timing details, call site and callback function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* Add details about pending meta data operations to in_flight_summary.
* Report number of requests waiting for activity log transactions.
* timing details of peer_requests to in_flight_summary.
* FLUSH details
DRBD devides the incoming request stream into "epochs",
in which peers are allowed to re-order writes independendly.
These epochs are separated by P_BARRIER on the replication link.
Such barrier packets, depending on configuration, may cause
the receiving side to drain the lower level device request queues
and call blkdev_issue_flush().
This is known to be an other major source of latency in DRBD.
Track timing details of calls to blkdev_issue_flush(),
and add them to in_flight_summary.
* data socket stats
To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
it is useful to see network buffer stats along with the timing details of
requests, peer requests, and meta data IO.
* pending bitmap IO timing details to in_flight_summary.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Try to close the race between open() and debugfs_remove_recursive()
from inside an object destructor.
Once open succeeds, the object should stay around.
Open should not succeed if the object has already reached its destructor.
This may be overkill, but to make that happen, we check for existence of
a parent directory, "stale-ness" of "this" dentry, and serialize
kref_get_unless_zero() on the outermost object relevant for this file
with d_delete() on this dentry (using the parent's i_mutex).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To help diagnosing "high latency" or "hung" IO situations on DRBD,
present per drbd resource group a summary of operations currently in progress.
First item is a list of oldest drbd_request objects
waiting for various things:
* still being prepared
* waiting for activity log transaction
* waiting for local disk
* waiting to be sent
* waiting for peer acknowledgement ("receive ack", "write ack")
* waiting for peer epoch acknowledgement ("barrier ack")
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Add new debugfs hierarchy /sys/kernel/debug/
drbd/
resources/
$resource_name/connections/peer/$volume_number/
$resource_name/volumes/$volume_number/
minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/
Followup commits will populate this hierarchy with files containing
statistics, diagnostic information and some attribute data.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Track start and submit time of bitmap operations, and
add pending bitmap IO contexts to a new pending_bitmap_io list.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Initialize peer_request with timestamp and proper empty list head.
Add peer_request to list early, so debugfs can find this request and
report it as "preparing", even if we sleep before we actually submit it.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To be able to present timing details in debugfs,
we need to track preparation/submit times of peer requests.
Track peer request flags early,
before they are put on the epoch_entry lists.
Waiting for activity log transactions may be a major latency factor.
We want to be able to present the peer_request state accurately in
debugfs, and what it is waiting for.
Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
Set it only *after* calling drbd_al_begin_io(),
clear it as soon as we call drbd_al_complete_io().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Background resynchronisation does some "side-stepping", or throttles
itself, if it detects application IO activity, and the current resync
rate estimate is above the configured "cmin-rate".
What was not detected: if there is no application IO,
because it blocks on activity log transactions.
Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
and count non-zero as application IO activity.
This counter is exposed at proc_details level 2 and above.
Also make sure to release the currently locked resync extent
if we side-step due to such voluntary throttling.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
A request that is to be shipped to the peer goes through a few stages:
- queued
- sent, waiting for ack
- ack received, waiting for "barrier ack", which is re-order epoch being
closed on the peer by acknowledging a "cache flush" equivalent
on the lower level device.
In the later two stages, depending on protocol, we may have already
completed this request to the upper layers, so it won't be found anymore
on device->pending_master_completion[] lists.
Track the oldest request yet to be sent (req_next), the oldest not yet
acknowledged (req_ack_pending) and the oldest "still waiting for
something from the peer" (req_not_net_done), doing short list walks on
the transfer log to find the next pending one whenever such a request
makes progress.
Now we have a fast way to look up the oldest requests,
don't do a transfer log walk every time.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Adding requests to per-device fifo lists as soon as possible after
allocating them leaves a simple list_first_entry_or_null() to find the
oldest request, regardless what it is still waiting for.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Record (in jiffies) how much time a request spends in which stages.
Followup commits will use and present this additional timing information
so we can better locate and tackle the root causes of latency spikes,
or present the backlog for asynchronous replication.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
For diagnostic purposes, track intent, start time
and latest submit time of meta data IO.
Move separate members from struct drbd_device
into the embeded struct drbd_md_io.
s/md_io_(page|in_use)/md_io.\1/
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_destroy_device means to give up reference counts
on the connection(s) reachable via the peer_device(s).
It must not do that by iterating via device->resource->connections,
resource and connections may have already been disassociated
by drbd_free_resource, and we'd leak connection refs.
Instead, iterate via device->peer_devices->connection.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Now that we have additional asynchronous kref_get/kref_put
via debugfs, make sure we catch access after free.
Poison struct drbd_device, drbd_connection and drbd_resource
before kfree() with 0xfd, 0xfc, and 0xf2, respectively.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To be able to find and present such zero-out fallback peer_requests
in debugfs, we add those to "active_ee", once that list drained.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...)
consistently "oldest first". That way finding the oldest not yet
successfully processed request is simply list_first_entry_or_null.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The only user of drbd_md_flush was bm_rw(),
and it is always followed by either a drbd_md_sync(),
or an al_write_transaction(), which, if so configured,
both end up submiting a FLUSH|FUA request anyways.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We sometimes do
if (list_empty(&w.list))
drbd_queue_work(&q, &w.list);
Removal (list_del_init) may happen outside all locks, after all
pending work entries have been moved to an on-stack local work list.
For not dynamically allocated, but embeded, work structs,
we must avoid to re-add until it really was removed.
Move that list_empty check inside the spin_lock(&q->q_lock)
within the helper function, and change to list_empty_careful().
This may have been the reason for a list_add corruption
inside drbd_queue_work().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We try to limit the number of "in-flight" resync requests.
One condition for that is the amount of requested data should not exceed
half of what can be covered by our "max-buffers" setting.
However we compared number of 4k pages with number of in-flight 512 Byte
sectors, and this extra throttle triggered much earlier than intended.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we lost a disk during the first resync after primary crash,
we could have prematurely cleared the CRASHED_PRIMARY flag.
Testing on C_CONNECTED is not what we meant there,
but testing for both peers to become D_UP_TO_DATE.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we throttle resync because the socket sendbuffer is filling up,
tell TCP about it, so it may expand the sendbuffer for us.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Just about all of these have been converted to __func__,
so convert the last uses.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we already "pulled ahead", we can short-circuit,
and avoid logging the same messages over and over again.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If "dirty" blocks are written to during resync,
that brings them in-sync.
By explicitly requesting write-acks during resync even in protocol != C,
we now can actually respect this.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In setups involving a DRBD-proxy and connections that experience a lot of
buffer-bloat it might be necessary to set ping-timeout to an
unusual high value. By default DRBD uses the same value to wait if a newly
established TCP-connection is stable. Since the DRBD-proxy is usually located
in the same data center such a long wait time may hinder DRBD's connect process.
In such setups socket-check-timeout should be set to
at least to the round trip time between DRBD and DRBD-proxy. I.e. in most
cases to 1.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Before the patch
'drbd: Keep the listening socket open while trying to connect to the peer'
the newly created socket inherited the receive timeout from the listen
socket. The listen socket had a receive timeout of connect-intervall
+- 30% random jitter.
The real issue is that after the mentioned patch we had no timeout at all.
Now use 4 times the ping-timeout.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Checksum based resync trades CPU cycles for network bandwidth,
in situations where we expect much of the to-be-resynced blocks
to be actually identical on both sides already.
In a "network hickup" scenario, it won't help:
all to-be-resynced blocks will typically be different.
The use case is for the resync of *potentially* different blocks
after crash recovery -- the crash recovery had marked larger areas
(those covered by the activity log) as need-to-be-resynced,
just in case. Most of those blocks will be identical.
This option makes it possible to configure checksum based resync,
but only actually use it for the first resync after primary crash.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
During handshake, we compare backend sizes, and user set limits,
and agree on what device size we are going to expose.
We remember that last-agreed-size in our meta data.
But if we come up diskless, we have to accept what the peer
presents us with. We used to accept the peers maximum potential
capacity (backend size), which is wrong, and could lead to IO errors
due to access beyond end of device.
Instead, we need to accept the peer's current size.
Unless that is communicated as 0, in which case we
accept the backend size, or the user set limit, if set.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We intentionally do not serialize /proc/drbd access with
internal state changes or statistic updates.
Because of that, cat /proc/drbd may race with resync just being
finished, still see the sync state, and find information about
number of blocks still to go, but then find the total number
of blocks within this resync has just been reset to 0
when accessing it.
This now produces bogus numbers in the resync speed estimates.
Fix by accessing all relevant data only once,
and fixing it up if "still to go" happens to be more than "total".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Get rid of dump_stack() debug statements.
There is no point whatsoever in registering and unregistering a reboot
notifier that doesn't do anything.
The intention was to switch to an "emergency read-only" mode,
so we won't have to resync the full activity log just because
we had been Primary before the reboot.
Once we have that implemented, we may re-introduce the reboot notifier.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The textual representation of resync extents in /proc/drbd presented
with proc_details >= 3 was wrong, it used bitnumbers as bitmasks.
It was not particularly useful either, and I doubt anyone has even tried
to look at it in the last few years. Drop it.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The last user was al_write_transaction, if called with "delegate",
and the last user to call it with "delegate = true" was the receiver
thread, which has no need to delegate, but can call it himself.
Finally drop the delegate parameter, drop the extra
w_al_write_transaction callback, and drop drbd_queue_work_front.
Do not (yet) change dequeue_work_item to dequeue_work_batch, though.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This replaces the md_sync_work member of struct drbd_device
by a new MD_SYNC "work bit" in device->flags.
This replaces the resync_start_work member of struct drbd_device
by a new RS_START "work bit" in device->flags.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The recent fix to put_ldev() (correct ordering of access to local_cnt
and state.disk; memory barrier in __drbd_set_state) guarantees
that the cleanup happens exactly once.
However it does not yet guarantee that the cleanup happens from worker
context, the last put_ldev() may still happen from atomic context,
which must not happen: blkdev_put() may sleep.
Fix this by scheduling the cleanup to the worker instead,
using a couple more bits in device->flags and a new helper,
drbd_device_post_work().
Generalized the "resync progress" work to cover these new work bits.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: bd_release+0x21/0x70
Process drbd_w_t7146
Call Trace:
close_bdev_exclusive
drbd_free_ldev [drbd]
drbd_ldev_destroy [drbd]
w_after_state_ch [drbd]
Race probably went like this:
state.disk = D_FAILED
... first one to hit zero during D_FAILED:
put_ldev() /* ----------------> 0 */
i = atomic_dec_return()
if (i == 0)
if (state.disk == D_FAILED)
schedule_work(go_diskless)
/* 1 <------ */ get_ldev_if_state()
go_diskless()
do_some_pre_cleanup() corresponding put_ldev():
force_state(D_DISKLESS) /* 0 <------ */ i = atomic_dec_return()
if (i == 0)
atomic_inc() /* ---------> 1 */
state.disk = D_DISKLESS
schedule_work(after_state_ch) /* execution pre-empted by IRQ ? */
after_state_ch()
put_ldev()
i = atomic_dec_return() /* 0 */
if (i == 0)
if (state.disk == D_DISKLESS) if (state.disk == D_DISKLESS)
drbd_ldev_destroy() drbd_ldev_destroy();
Trying to fix this by checking the disk state *before* the
atomic_dec_return(), which implies memory barriers, and by inserting
extra memory barriers around the state assignment in __drbd_set_state().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
For some reason we have assumed NOIDLE was implied
by one of the other flags we set. It is not (anymore?).
Explicitly set REQ_NOIDLE for synchronous meta data updates,
or we can seriously starve random writes when using CFQ.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This probably does not have any real life impact,
but we should first persist any potentially new UUID
and other meta data flags, as well as our new role,
before we allow/disallow write access.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>