blkdev_issue_discard() will zero a given block range. This is done by
way of explicit writing, thus provisioning or allocating the blocks on
disk.
There are use cases where the desired behavior is to zero the blocks but
unprovision them if possible. The blocks must deterministically contain
zeroes when they are subsequently read back.
This patch adds a flag to blkdev_issue_zeroout() that provides this
variant. If the discard flag is set and a block device guarantees
discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
the device does not support discard_zeroes_data or if the discard
request fails we will fall back to first REQ_WRITE_SAME and then a
regular REQ_WRITE.
Also update the callers of blkdev_issue_zero() to reflect the new flag
and make sb_issue_zeroout() prefer the discard approach.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer driver updates from Jens Axboe:
- NVMe updates:
- The blk-mq conversion from Matias (and others)
- A stack of NVMe bug fixes from the nvme tree, mostly from Keith.
- Various bug fixes from me, fixing issues in both the blk-mq
conversion and generic bugs.
- Abort and CPU online fix from Sam.
- Hot add/remove fix from Indraneel.
- A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)
- With the generic IO stat accounting from 3.19/core, converting md,
bcache, and rsxx to use those. From Gu Zheng.
- Boundary check for queue/irq mode for null_blk from Matias. Fixes
cases where invalid values could be given, causing the device to hang.
- The xen blkfront pull request, with two bug fixes from Vitaly.
* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
NVMe: fix race condition in nvme_submit_sync_cmd()
NVMe: fix retry/error logic in nvme_queue_rq()
NVMe: Fix FS mount issue (hot-remove followed by hot-add)
NVMe: fix error return checking from blk_mq_alloc_request()
NVMe: fix freeing of wrong request in abort path
xen/blkfront: remove redundant flush_op
xen/blkfront: improve protection against issuing unsupported REQ_FUA
NVMe: Fix command setup on IO retry
null_blk: boundary check queue_mode and irqmode
block/rsxx: use generic io stats accounting functions to simplify io stat accounting
md: use generic io stats accounting functions to simplify io stat accounting
drbd: use generic io stats accounting functions to simplify io stat accounting
md/bcache: use generic io stats accounting functions to simplify io stat accounting
NVMe: Update module version major number
NVMe: fail pci initialization if the device doesn't have any BARs
NVMe: add ->exit_hctx() hook
NVMe: make setup work for devices that don't do INTx
NVMe: enable IO stats by default
NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
NVMe: replace blk_put_request() with blk_mq_free_request()
...
Old backward-compat cruft
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A connection timeout affects all volumes of a resource!
Under the following conditions:
A resource with multiple volumes
AND
ko-count >=1
AND
a write request triggers the timeout (ko-count * timeout)
DRBD's internal state gets confused. That in turn may
lead to very miss leading follow up failures. E.g.
"BUG: scheduling while atomic"
CC: stable@kernel.org # v3.17
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was not noticed for many years. Affects operation if
md raid is used a backing device for DRBD.
CC: stable@kernel.org # v3.2+
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If for some reason DRBD resync was the only activity on a backend
device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is
still initialization time, and keep throttling the resync.
This patch explicitly initializes ->rs_last_events to the current
backend event counters, and drops the rs_last_events == 0 from the
throttle condition.
Reported-by: Mikhail Sugakov <msugakov@amazon.de>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Symptoms:
If DRBD was "cleanly shut down" (all in sync, both Secondary before
disconnect, identical data generation uuids), and then one side was
promoted *during* the next connection handshake, the role change
could confuse the handshake.
The Primary would get stuck in WFBitmapS, the Secondary would log
unexpected cstate (Connected) in receive_bitmap
and get stuck in WFBitmapT.
Fix:
The test in is_valid_soft_transition wrong. It works because
the not allowed actions (promote/attach) do not touch the
cstate. The previous condition failed to demand a cstate change
in one clause.
In order to avoid deadlocks give up the state_mutex while waiting
for the transient state to go away.
Conflicts:
drbd/drbd_state.c
drbd/drbd_state.h
drbd/drbd_wrappers.h
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Avoid generic netlink calls in other parts of the code base.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The original code are the same as RB_DECLARE_CALLBACKS().
CC: Michel Lespinasse <walken@google.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 98683650 "Merge branch 'drbd-8.4_ed6' into
for-3.8-drivers-drbd-8.4_ed6" switches to the new augment API, but the
new API requires that the tree is augmented before rb_insert_augmented()
is called, which is missing.
So we add the augment-code to drbd_insert_interval() when it travels the
tree up to down before rb_insert_augmented(). See the example in
include/linux/interval_tree_generic.h or Documentation/rbtree.txt.
drbd_insert_interval() may cancel the insertion when traveling, in this
case, the just added augment-code does nothing before cancel since the
@this node is already in the subtrees in this case.
CC: Michel Lespinasse <walken@google.com>
CC: stable@kernel.org # v3.10+
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Was broken in 2010 with commit 4b0715f096
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The worker may now dequeue work items in batches.
This should reduce lock contention during busy periods.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Shorten receive path in the asender thread. Reduces CPU utilisation
of asender when receiving packets, and with that increases IOPs.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This macro doesn't add any value; just use test_bit() instead.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These macros can easily be replaced with its definition.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
request_timer_fn() takes resource->req_lock via the device and releases it via
the connection. Avoid this as it is confusing static code checkers.
Reported-by: "Dan Carpenter" <dan.carpenter@oracle.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
See "drbd/resources/*/volumes/*/resync_extents".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now they follow the _endio naming sheme.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rename local variable 'ds' to 'disk_state' or 'data_size'.
'dgs' to 'digest_size'
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver changes from Jens Axboe:
"Nothing out of the ordinary here, this pull request contains:
- A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
and Surbhi Palande. No new features, just a lot of fixes.
- The usual round of drbd updates from Andreas Gruenbacher, Lars
Ellenberg, and Philipp Reisner.
- virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
has taken it one step further and added support for actually using
more than one queue.
- Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
compliment the the default behavior of adding to the tail of the
queue. From Douglas Gilbert"
* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
bcache: Drop unneeded blk_sync_queue() calls
bcache: add mutex lock for bch_is_open
bcache: Correct printing of btree_gc_max_duration_ms
bcache: try to set b->parent properly
bcache: fix memory corruption in init error path
bcache: fix crash with incomplete cache set
bcache: Fix more early shutdown bugs
bcache: fix use-after-free in btree_gc_coalesce()
bcache: Fix an infinite loop in journal replay
bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
bcache: bcache_write tracepoint was crashing
bcache: fix typo in bch_bkey_equal_header
bcache: Allocate bounce buffers with GFP_NOWAIT
bcache: Make sure to pass GFP_WAIT to mempool_alloc()
bcache: fix uninterruptible sleep in writeback thread
bcache: wait for buckets when allocating new btree root
bcache: fix crash on shutdown in passthrough mode
bcache: fix lockdep warnings on shutdown
bcache allocator: send discards with correct size
bcache: Fix to remove the rcu_sched stalls.
...
My static checker warns that "data_size" could be negative and underflow
the limit check. The code looks suspicious but I don't know if it is a
real bug.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Don't error out with misleading "out of memory"
if the cpu-mask has more bits set than there are CPUs.
Just truncate to nr_cpu_ids implicitly.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
size is always 4096,
page is always device->md_io.page.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
During resync, if we need to block some specific incoming write because
of active resync requests to that same range, we potentially caused
*all* new application writes (to "cold" activity log extents) to block
until this one request has been processed.
Improve the do_submit() logic to
* grab all incoming requests to some "incoming" list
* process this list
- move aside requests that are blocked by resync
- prepare activity log transactions,
- commit transactions and submit corresponding requests
- if there are remaining requests that only wait for
activity log extents to become free, stop the fast path
(mark activity log as "starving")
- iterate until no more requests are waiting for the activity log,
but all potentially remaining requests are only blocked by resync
* only then grab new incoming requests
That way, very busy IO on currently "hot" activity log extents cannot
starve scattered IO to "cold" extents. And blocked-by-resync requests
are processed once resync traffic on the affected region has ceased,
without blocking anything else.
The only blocking mode left is when we cannot start requests to "cold"
extents because all currently "hot" extents are actually used.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The data generation identifiers used to be exposed via sysfs
at /sys/block/drbdX/drbd/meta_data/data_gen_id (out-of-tree),
for advanced policy scripting.
Bring that information over to debugfs.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Information of former /sys/block/drbdX/drbd/oldest_requests
is already with higher detail in these files:
debugfs/drbd/resource/$name/in_flight_summary,
debugfs/drbd/resource/$name/volumes/$vnr/oldest_requests
This patch adds
debugfs/drbd/resource/$name/connections/peer/oldest_requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Make the first line of debugfs files a version number,
starting now with "v: 0".
If we change content of presentation, we will bump that.
Monitoring or diagnostic scritps that may parse these files
can then easily know when they need to be reviewed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Show oldest requests
* pending master bio completion and,
* if different, local disk bio completion.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Add a per-connection worker thread callback_history
with timing details, call site and callback function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* Add details about pending meta data operations to in_flight_summary.
* Report number of requests waiting for activity log transactions.
* timing details of peer_requests to in_flight_summary.
* FLUSH details
DRBD devides the incoming request stream into "epochs",
in which peers are allowed to re-order writes independendly.
These epochs are separated by P_BARRIER on the replication link.
Such barrier packets, depending on configuration, may cause
the receiving side to drain the lower level device request queues
and call blkdev_issue_flush().
This is known to be an other major source of latency in DRBD.
Track timing details of calls to blkdev_issue_flush(),
and add them to in_flight_summary.
* data socket stats
To be able to diagnose bottlenecks and root causes of "slow" IO on DRBD,
it is useful to see network buffer stats along with the timing details of
requests, peer requests, and meta data IO.
* pending bitmap IO timing details to in_flight_summary.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Try to close the race between open() and debugfs_remove_recursive()
from inside an object destructor.
Once open succeeds, the object should stay around.
Open should not succeed if the object has already reached its destructor.
This may be overkill, but to make that happen, we check for existence of
a parent directory, "stale-ness" of "this" dentry, and serialize
kref_get_unless_zero() on the outermost object relevant for this file
with d_delete() on this dentry (using the parent's i_mutex).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To help diagnosing "high latency" or "hung" IO situations on DRBD,
present per drbd resource group a summary of operations currently in progress.
First item is a list of oldest drbd_request objects
waiting for various things:
* still being prepared
* waiting for activity log transaction
* waiting for local disk
* waiting to be sent
* waiting for peer acknowledgement ("receive ack", "write ack")
* waiting for peer epoch acknowledgement ("barrier ack")
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Add new debugfs hierarchy /sys/kernel/debug/
drbd/
resources/
$resource_name/connections/peer/$volume_number/
$resource_name/volumes/$volume_number/
minors/$minor_number -> ../resources/$resource_name/volumes/$volume_number/
Followup commits will populate this hierarchy with files containing
statistics, diagnostic information and some attribute data.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Track start and submit time of bitmap operations, and
add pending bitmap IO contexts to a new pending_bitmap_io list.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Initialize peer_request with timestamp and proper empty list head.
Add peer_request to list early, so debugfs can find this request and
report it as "preparing", even if we sleep before we actually submit it.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To be able to present timing details in debugfs,
we need to track preparation/submit times of peer requests.
Track peer request flags early,
before they are put on the epoch_entry lists.
Waiting for activity log transactions may be a major latency factor.
We want to be able to present the peer_request state accurately in
debugfs, and what it is waiting for.
Consistently mark/unmark peer requests with EE_CALL_AL_COMPLETE_IO.
Set it only *after* calling drbd_al_begin_io(),
clear it as soon as we call drbd_al_complete_io().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Background resynchronisation does some "side-stepping", or throttles
itself, if it detects application IO activity, and the current resync
rate estimate is above the configured "cmin-rate".
What was not detected: if there is no application IO,
because it blocks on activity log transactions.
Introduce a new atomic_t ap_actlog_cnt, tracking such blocked requests,
and count non-zero as application IO activity.
This counter is exposed at proc_details level 2 and above.
Also make sure to release the currently locked resync extent
if we side-step due to such voluntary throttling.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
A request that is to be shipped to the peer goes through a few stages:
- queued
- sent, waiting for ack
- ack received, waiting for "barrier ack", which is re-order epoch being
closed on the peer by acknowledging a "cache flush" equivalent
on the lower level device.
In the later two stages, depending on protocol, we may have already
completed this request to the upper layers, so it won't be found anymore
on device->pending_master_completion[] lists.
Track the oldest request yet to be sent (req_next), the oldest not yet
acknowledged (req_ack_pending) and the oldest "still waiting for
something from the peer" (req_not_net_done), doing short list walks on
the transfer log to find the next pending one whenever such a request
makes progress.
Now we have a fast way to look up the oldest requests,
don't do a transfer log walk every time.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Adding requests to per-device fifo lists as soon as possible after
allocating them leaves a simple list_first_entry_or_null() to find the
oldest request, regardless what it is still waiting for.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Record (in jiffies) how much time a request spends in which stages.
Followup commits will use and present this additional timing information
so we can better locate and tackle the root causes of latency spikes,
or present the backlog for asynchronous replication.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
For diagnostic purposes, track intent, start time
and latest submit time of meta data IO.
Move separate members from struct drbd_device
into the embeded struct drbd_md_io.
s/md_io_(page|in_use)/md_io.\1/
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_destroy_device means to give up reference counts
on the connection(s) reachable via the peer_device(s).
It must not do that by iterating via device->resource->connections,
resource and connections may have already been disassociated
by drbd_free_resource, and we'd leak connection refs.
Instead, iterate via device->peer_devices->connection.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Now that we have additional asynchronous kref_get/kref_put
via debugfs, make sure we catch access after free.
Poison struct drbd_device, drbd_connection and drbd_resource
before kfree() with 0xfd, 0xfc, and 0xf2, respectively.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To be able to find and present such zero-out fallback peer_requests
in debugfs, we add those to "active_ee", once that list drained.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Keep the epoch entry lists (active_ee, read_ee, sync_ee, ...)
consistently "oldest first". That way finding the oldest not yet
successfully processed request is simply list_first_entry_or_null.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The only user of drbd_md_flush was bm_rw(),
and it is always followed by either a drbd_md_sync(),
or an al_write_transaction(), which, if so configured,
both end up submiting a FLUSH|FUA request anyways.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We sometimes do
if (list_empty(&w.list))
drbd_queue_work(&q, &w.list);
Removal (list_del_init) may happen outside all locks, after all
pending work entries have been moved to an on-stack local work list.
For not dynamically allocated, but embeded, work structs,
we must avoid to re-add until it really was removed.
Move that list_empty check inside the spin_lock(&q->q_lock)
within the helper function, and change to list_empty_careful().
This may have been the reason for a list_add corruption
inside drbd_queue_work().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>