dm zoned: drive-managed zoned block device target

The dm-zoned device mapper target provides transparent write access to zoned block devices (ZBC and ZAC compliant block devices). dm-zoned hides to the device user (a file system or an application doing raw block device accesses) any constraint imposed on write requests by the device, equivalent to a drive-managed zoned block device model. Write requests are processed using a combination of on-disk buffering using the device conventional zones and direct in-place processing for requests aligned to a zone sequential write pointer position. A background reclaim process implemented using dm_kcopyd_copy ensures that conventional zones are always available for executing unaligned write requests. The reclaim process overhead is minimized by managing buffer zones in a least-recently-written order and first targeting the oldest buffer zones. Doing so, blocks under regular write access (such as metadata blocks of a file system) remain stored in conventional zones, resulting in no apparent overhead. dm-zoned implementation focus on simplicity and on minimizing overhead (CPU, memory and storage overhead). For a 14TB host-managed disk with 256 MB zones, dm-zoned memory usage per disk instance is at most about 3 MB and as little as 5 zones will be used internally for storing metadata and performing buffer zone reclaim operations. This is achieved using zone level indirection rather than a full block indirection system for managing block movement between zones. dm-zoned primary target is host-managed zoned block devices but it can also be used with host-aware device models to mitigate potential device-side performance degradation due to excessive random writing. Zoned block devices can be formatted and checked for use with the dm-zoned target using the dmzadm utility available at: https://github.com/hgst/dm-zoned-tools Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [Mike Snitzer partly refactored Damien's original work to cleanup the code] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-07 15:55:39 +09:00 · 2017-06-07 15:55:39 +09:00 · 3b1a94c88b
--- a/Documentation/device-mapper/dm-zoned.txt
+++ b/Documentation/device-mapper/dm-zoned.txt
@ -0,0 +1,144 @@
 dm-zoned
 ========
 The dm-zoned device mapper target exposes a zoned block device (ZBC and
 ZAC compliant devices) as a regular block device without any write
 pattern constraints. In effect, it implements a drive-managed zoned
 block device which hides from the user (a file system or an application
 doing raw block device accesses) the sequential write constraints of
 host-managed zoned block devices and can mitigate the potential
 device-side performance degradation due to excessive random writes on
 host-aware zoned block devices.
 For a more detailed description of the zoned block device models and
 their constraints see (for SCSI devices):
 http://www.t10.org/drafts.htm#ZBC_Family
 and (for ATA devices):
 http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
 The dm-zoned implementation is simple and minimizes system overhead (CPU
 and memory usage as well as storage capacity loss). For a 10TB
 host-managed disk with 256 MB zones, dm-zoned memory usage per disk
 instance is at most 4.5 MB and as little as 5 zones will be used
 internally for storing metadata and performaing reclaim operations.
 dm-zoned target devices are formatted and checked using the dmzadm
 utility available at:
 https://github.com/hgst/dm-zoned-tools
 Algorithm
 =========
 dm-zoned implements an on-disk buffering scheme to handle non-sequential
 write accesses to the sequential zones of a zoned block device.
 Conventional zones are used for caching as well as for storing internal
 metadata.
 The zones of the device are separated into 2 types:
 1) Metadata zones: these are conventional zones used to store metadata.
 Metadata zones are not reported as useable capacity to the user.
 2) Data zones: all remaining zones, the vast majority of which will be
 sequential zones used exclusively to store user data. The conventional
 zones of the device may be used also for buffering user random writes.
 Data in these zones may be directly mapped to the conventional zone, but
 later moved to a sequential zone so that the conventional zone can be
 reused for buffering incoming random writes.
 dm-zoned exposes a logical device with a sector size of 4096 bytes,
 irrespective of the physical sector size of the backend zoned block
 device being used. This allows reducing the amount of metadata needed to
 manage valid blocks (blocks written).
 The on-disk metadata format is as follows:
 1) The first block of the first conventional zone found contains the
 super block which describes the on disk amount and position of metadata
 blocks.
 2) Following the super block, a set of blocks is used to describe the
 mapping of the logical device blocks. The mapping is done per chunk of
 blocks, with the chunk size equal to the zoned block device size. The
 mapping table is indexed by chunk number and each mapping entry
 indicates the zone number of the device storing the chunk of data. Each
 mapping entry may also indicate if the zone number of a conventional
 zone used to buffer random modification to the data zone.
 3) A set of blocks used to store bitmaps indicating the validity of
 blocks in the data zones follows the mapping table. A valid block is
 defined as a block that was written and not discarded. For a buffered
 data chunk, a block is always valid only in the data zone mapping the
 chunk or in the buffer zone of the chunk.
 For a logical chunk mapped to a conventional zone, all write operations
 are processed by directly writing to the zone. If the mapping zone is a
 sequential zone, the write operation is processed directly only if the
 write offset within the logical chunk is equal to the write pointer
 offset within of the sequential data zone (i.e. the write operation is
 aligned on the zone write pointer). Otherwise, write operations are
 processed indirectly using a buffer zone. In that case, an unused
 conventional zone is allocated and assigned to the chunk being
 accessed. Writing a block to the buffer zone of a chunk will
 automatically invalidate the same block in the sequential zone mapping
 the chunk. If all blocks of the sequential zone become invalid, the zone
 is freed and the chunk buffer zone becomes the primary zone mapping the
 chunk, resulting in native random write performance similar to a regular
 block device.
 Read operations are processed according to the block validity
 information provided by the bitmaps. Valid blocks are read either from
 the sequential zone mapping a chunk, or if the chunk is buffered, from
 the buffer zone assigned. If the accessed chunk has no mapping, or the
 accessed blocks are invalid, the read buffer is zeroed and the read
 operation terminated.
 After some time, the limited number of convnetional zones available may
 be exhausted (all used to map chunks or buffer sequential zones) and
 unaligned writes to unbuffered chunks become impossible. To avoid this
 situation, a reclaim process regularly scans used conventional zones and
 tries to reclaim the least recently used zones by copying the valid
 blocks of the buffer zone to a free sequential zone. Once the copy
 completes, the chunk mapping is updated to point to the sequential zone
 and the buffer zone freed for reuse.
 Metadata Protection
 ===================
 To protect metadata against corruption in case of sudden power loss or
 system crash, 2 sets of metadata zones are used. One set, the primary
 set, is used as the main metadata region, while the secondary set is
 used as a staging area. Modified metadata is first written to the
 secondary set and validated by updating the super block in the secondary
 set, a generation counter is used to indicate that this set contains the
 newest metadata. Once this operation completes, in place of metadata
 block updates can be done in the primary metadata set. This ensures that
 one of the set is always consistent (all modifications committed or none
 at all). Flush operations are used as a commit point. Upon reception of
 a flush request, metadata modification activity is temporarily blocked
 (for both incoming BIO processing and reclaim process) and all dirty
 metadata blocks are staged and updated. Normal operation is then
 resumed. Flushing metadata thus only temporarily delays write and
 discard requests. Read requests can be processed concurrently while
 metadata flush is being executed.
 Usage
 =====
 A zoned block device must first be formatted using the dmzadm tool. This
 will analyze the device zone configuration, determine where to place the
 metadata sets on the device and initialize the metadata sets.
 Ex:
 dmzadm --format /dev/sdxx
 For a formatted device, the target can be created normally with the
 dmsetup utility. The only parameter that dm-zoned requires is the
 underlying zoned block device name. Ex:
 echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@ -521,6 +521,23 @@ config DM_INTEGRITY
 	  To compile this code as a module, choose M here: the module will
 	  be called dm-integrity.
 config DM_ZONED
 	tristate "Drive-managed zoned block device target support"
 	depends on BLK_DEV_DM
 	depends on BLK_DEV_ZONED
 	---help---
 	  This device-mapper target takes a host-managed or host-aware zoned
 	  block device and exposes most of its capacity as a regular block
 	  device (drive-managed zoned block device) without any write
 	  constraints. This is mainly intended for use with file systems that
 	  do not natively support zoned block devices but still want to
 	  benefit from the increased capacity offered by SMR disks. Other uses
 	  by applications using raw block devices (for example object stores)
 	  are also possible.
 	  To compile this code as a module, choose M here: the module will
 	  be called dm-zoned.
 	  If unsure, say N.
 endif # MD
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@ -20,6 +20,7 @@ dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
 obj-$(CONFIG_DM_INTEGRITY)	+= dm-integrity.o
 obj-$(CONFIG_DM_ZONED)		+= dm-zoned.o
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
--- a/drivers/md/dm-zoned-reclaim.c
+++ b/drivers/md/dm-zoned-reclaim.c
@ -0,0 +1,570 @@
 /*
 * Copyright (C) 2017 Western Digital Corporation or its affiliates.
 *
 * This file is released under the GPL.
 */
 #include "dm-zoned.h"
 #include <linux/module.h>
 #define	DM_MSG_PREFIX		"zoned reclaim"
 struct dmz_reclaim {
 	struct dmz_metadata     *metadata;
 	struct dmz_dev		*dev;
 	struct delayed_work	work;
 	struct workqueue_struct *wq;
 	struct dm_kcopyd_client	*kc;
 	struct dm_kcopyd_throttle kc_throttle;
 	int			kc_err;
 	unsigned long		flags;
 	/* Last target access time */
 	unsigned long		atime;
 };
 /*
 * Reclaim state flags.
 */
 enum {
 	DMZ_RECLAIM_KCOPY,
 };
 /*
 * Number of seconds of target BIO inactivity to consider the target idle.
 */
 #define DMZ_IDLE_PERIOD		(10UL * HZ)
 /*
 * Percentage of unmapped (free) random zones below which reclaim starts
 * even if the target is busy.
 */
 #define DMZ_RECLAIM_LOW_UNMAP_RND	30
 /*
 * Percentage of unmapped (free) random zones above which reclaim will
 * stop if the target is busy.
 */
 #define DMZ_RECLAIM_HIGH_UNMAP_RND	50
 /*
 * Align a sequential zone write pointer to chunk_block.
 */
 static int dmz_reclaim_align_wp(struct dmz_reclaim *zrc, struct dm_zone *zone,
 				sector_t block)
 {
 	struct dmz_metadata *zmd = zrc->metadata;
 	sector_t wp_block = zone->wp_block;
 	unsigned int nr_blocks;
 	int ret;
 	if (wp_block == block)
 		return 0;
 	if (wp_block > block)
 		return -EIO;
 	/*
 	 * Zeroout the space between the write
 	 * pointer and the requested position.
 	 */
 	nr_blocks = block - wp_block;
 	ret = blkdev_issue_zeroout(zrc->dev->bdev,
 				   dmz_start_sect(zmd, zone) + dmz_blk2sect(wp_block),
 				   dmz_blk2sect(nr_blocks), GFP_NOFS, false);
 	if (ret) {
 		dmz_dev_err(zrc->dev,
 			    "Align zone %u wp %llu to %llu (wp+%u) blocks failed %d",
 			    dmz_id(zmd, zone), (unsigned long long)wp_block,
 			    (unsigned long long)block, nr_blocks, ret);
 		return ret;
 	}
 	zone->wp_block = block;
 	return 0;
 }
 /*
 * dm_kcopyd_copy end notification.
 */
 static void dmz_reclaim_kcopy_end(int read_err, unsigned long write_err,
 				  void *context)
 {
 	struct dmz_reclaim *zrc = context;
 	if (read_err || write_err)
 		zrc->kc_err = -EIO;
 	else
 		zrc->kc_err = 0;
 	clear_bit_unlock(DMZ_RECLAIM_KCOPY, &zrc->flags);
 	smp_mb__after_atomic();
 	wake_up_bit(&zrc->flags, DMZ_RECLAIM_KCOPY);
 }
 /*
 * Copy valid blocks of src_zone into dst_zone.
 */
 static int dmz_reclaim_copy(struct dmz_reclaim *zrc,
 			    struct dm_zone *src_zone, struct dm_zone *dst_zone)
 {
 	struct dmz_metadata *zmd = zrc->metadata;
 	struct dmz_dev *dev = zrc->dev;
 	struct dm_io_region src, dst;
 	sector_t block = 0, end_block;
 	sector_t nr_blocks;
 	sector_t src_zone_block;
 	sector_t dst_zone_block;
 	unsigned long flags = 0;
 	int ret;
 	if (dmz_is_seq(src_zone))
 		end_block = src_zone->wp_block;
 	else
 		end_block = dev->zone_nr_blocks;
 	src_zone_block = dmz_start_block(zmd, src_zone);
 	dst_zone_block = dmz_start_block(zmd, dst_zone);
 	if (dmz_is_seq(dst_zone))
 		set_bit(DM_KCOPYD_WRITE_SEQ, &flags);
 	while (block < end_block) {
 		/* Get a valid region from the source zone */
 		ret = dmz_first_valid_block(zmd, src_zone, &block);
 		if (ret <= 0)
 			return ret;
 		nr_blocks = ret;
 		/*
 		 * If we are writing in a sequential zone, we must make sure
 		 * that writes are sequential. So Zeroout any eventual hole
 		 * between writes.
 		 */
 		if (dmz_is_seq(dst_zone)) {
 			ret = dmz_reclaim_align_wp(zrc, dst_zone, block);
 			if (ret)
 				return ret;
 		}
 		src.bdev = dev->bdev;
 		src.sector = dmz_blk2sect(src_zone_block + block);
 		src.count = dmz_blk2sect(nr_blocks);
 		dst.bdev = dev->bdev;
 		dst.sector = dmz_blk2sect(dst_zone_block + block);
 		dst.count = src.count;
 		/* Copy the valid region */
 		set_bit(DMZ_RECLAIM_KCOPY, &zrc->flags);
 		ret = dm_kcopyd_copy(zrc->kc, &src, 1, &dst, flags,
 				     dmz_reclaim_kcopy_end, zrc);
 		if (ret)
 			return ret;
 		/* Wait for copy to complete */
 		wait_on_bit_io(&zrc->flags, DMZ_RECLAIM_KCOPY,
 			       TASK_UNINTERRUPTIBLE);
 		if (zrc->kc_err)
 			return zrc->kc_err;
 		block += nr_blocks;
 		if (dmz_is_seq(dst_zone))
 			dst_zone->wp_block = block;
 	}
 	return 0;
 }
 /*
 * Move valid blocks of dzone buffer zone into dzone (after its write pointer)
 * and free the buffer zone.
 */
 static int dmz_reclaim_buf(struct dmz_reclaim *zrc, struct dm_zone *dzone)
 {
 	struct dm_zone *bzone = dzone->bzone;
 	sector_t chunk_block = dzone->wp_block;
 	struct dmz_metadata *zmd = zrc->metadata;
 	int ret;
 	dmz_dev_debug(zrc->dev,
 		      "Chunk %u, move buf zone %u (weight %u) to data zone %u (weight %u)",
 		      dzone->chunk, dmz_id(zmd, bzone), dmz_weight(bzone),
 		      dmz_id(zmd, dzone), dmz_weight(dzone));
 	/* Flush data zone into the buffer zone */
 	ret = dmz_reclaim_copy(zrc, bzone, dzone);
 	if (ret < 0)
 		return ret;
 	dmz_lock_flush(zmd);
 	/* Validate copied blocks */
 	ret = dmz_merge_valid_blocks(zmd, bzone, dzone, chunk_block);
 	if (ret == 0) {
 		/* Free the buffer zone */
 		dmz_invalidate_blocks(zmd, bzone, 0, zrc->dev->zone_nr_blocks);
 		dmz_lock_map(zmd);
 		dmz_unmap_zone(zmd, bzone);
 		dmz_unlock_zone_reclaim(dzone);
 		dmz_free_zone(zmd, bzone);
 		dmz_unlock_map(zmd);
 	}
 	dmz_unlock_flush(zmd);
 	return 0;
 }
 /*
 * Merge valid blocks of dzone into its buffer zone and free dzone.
 */
 static int dmz_reclaim_seq_data(struct dmz_reclaim *zrc, struct dm_zone *dzone)
 {
 	unsigned int chunk = dzone->chunk;
 	struct dm_zone *bzone = dzone->bzone;
 	struct dmz_metadata *zmd = zrc->metadata;
 	int ret = 0;
 	dmz_dev_debug(zrc->dev,
 		      "Chunk %u, move data zone %u (weight %u) to buf zone %u (weight %u)",
 		      chunk, dmz_id(zmd, dzone), dmz_weight(dzone),
 		      dmz_id(zmd, bzone), dmz_weight(bzone));
 	/* Flush data zone into the buffer zone */
 	ret = dmz_reclaim_copy(zrc, dzone, bzone);
 	if (ret < 0)
 		return ret;
 	dmz_lock_flush(zmd);
 	/* Validate copied blocks */
 	ret = dmz_merge_valid_blocks(zmd, dzone, bzone, 0);
 	if (ret == 0) {
 		/*
 		 * Free the data zone and remap the chunk to
 		 * the buffer zone.
 		 */
 		dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks);
 		dmz_lock_map(zmd);
 		dmz_unmap_zone(zmd, bzone);
 		dmz_unmap_zone(zmd, dzone);
 		dmz_unlock_zone_reclaim(dzone);
 		dmz_free_zone(zmd, dzone);
 		dmz_map_zone(zmd, bzone, chunk);
 		dmz_unlock_map(zmd);
 	}
 	dmz_unlock_flush(zmd);
 	return 0;
 }
 /*
 * Move valid blocks of the random data zone dzone into a free sequential zone.
 * Once blocks are moved, remap the zone chunk to the sequential zone.
 */
 static int dmz_reclaim_rnd_data(struct dmz_reclaim *zrc, struct dm_zone *dzone)
 {
 	unsigned int chunk = dzone->chunk;
 	struct dm_zone *szone = NULL;
 	struct dmz_metadata *zmd = zrc->metadata;
 	int ret;
 	/* Get a free sequential zone */
 	dmz_lock_map(zmd);
 	szone = dmz_alloc_zone(zmd, DMZ_ALLOC_RECLAIM);
 	dmz_unlock_map(zmd);
 	if (!szone)
 		return -ENOSPC;
 	dmz_dev_debug(zrc->dev,
 		      "Chunk %u, move rnd zone %u (weight %u) to seq zone %u",
 		      chunk, dmz_id(zmd, dzone), dmz_weight(dzone),
 		      dmz_id(zmd, szone));
 	/* Flush the random data zone into the sequential zone */
 	ret = dmz_reclaim_copy(zrc, dzone, szone);
 	dmz_lock_flush(zmd);
 	if (ret == 0) {
 		/* Validate copied blocks */
 		ret = dmz_copy_valid_blocks(zmd, dzone, szone);
 	}
 	if (ret) {
 		/* Free the sequential zone */
 		dmz_lock_map(zmd);
 		dmz_free_zone(zmd, szone);
 		dmz_unlock_map(zmd);
 	} else {
 		/* Free the data zone and remap the chunk */
 		dmz_invalidate_blocks(zmd, dzone, 0, zrc->dev->zone_nr_blocks);
 		dmz_lock_map(zmd);
 		dmz_unmap_zone(zmd, dzone);
 		dmz_unlock_zone_reclaim(dzone);
 		dmz_free_zone(zmd, dzone);
 		dmz_map_zone(zmd, szone, chunk);
 		dmz_unlock_map(zmd);
 	}
 	dmz_unlock_flush(zmd);
 	return 0;
 }
 /*
 * Reclaim an empty zone.
 */
 static void dmz_reclaim_empty(struct dmz_reclaim *zrc, struct dm_zone *dzone)
 {
 	struct dmz_metadata *zmd = zrc->metadata;
 	dmz_lock_flush(zmd);
 	dmz_lock_map(zmd);
 	dmz_unmap_zone(zmd, dzone);
 	dmz_unlock_zone_reclaim(dzone);
 	dmz_free_zone(zmd, dzone);
 	dmz_unlock_map(zmd);
 	dmz_unlock_flush(zmd);
 }
 /*
 * Find a candidate zone for reclaim and process it.
 */
 static void dmz_reclaim(struct dmz_reclaim *zrc)
 {
 	struct dmz_metadata *zmd = zrc->metadata;
 	struct dm_zone *dzone;
 	struct dm_zone *rzone;
 	unsigned long start;
 	int ret;
 	/* Get a data zone */
 	dzone = dmz_get_zone_for_reclaim(zmd);
 	if (!dzone)
 		return;
 	start = jiffies;
 	if (dmz_is_rnd(dzone)) {
 		if (!dmz_weight(dzone)) {
 			/* Empty zone */
 			dmz_reclaim_empty(zrc, dzone);
 			ret = 0;
 		} else {
 			/*
 			 * Reclaim the random data zone by moving its
 			 * valid data blocks to a free sequential zone.
 			 */
 			ret = dmz_reclaim_rnd_data(zrc, dzone);
 		}
 		rzone = dzone;
 	} else {
 		struct dm_zone *bzone = dzone->bzone;
 		sector_t chunk_block = 0;
 		ret = dmz_first_valid_block(zmd, bzone, &chunk_block);
 		if (ret < 0)
 			goto out;
 		if (ret == 0 || chunk_block >= dzone->wp_block) {
 			/*
 			 * The buffer zone is empty or its valid blocks are
 			 * after the data zone write pointer.
 			 */
 			ret = dmz_reclaim_buf(zrc, dzone);
 			rzone = bzone;
 		} else {
 			/*
 			 * Reclaim the data zone by merging it into the
 			 * buffer zone so that the buffer zone itself can
 			 * be later reclaimed.
 			 */
 			ret = dmz_reclaim_seq_data(zrc, dzone);
 			rzone = dzone;
 		}
 	}
 out:
 	if (ret) {
 		dmz_unlock_zone_reclaim(dzone);
 		return;
 	}
 	(void) dmz_flush_metadata(zrc->metadata);
 	dmz_dev_debug(zrc->dev, "Reclaimed zone %u in %u ms",
 		      dmz_id(zmd, rzone), jiffies_to_msecs(jiffies - start));
 }
 /*
 * Test if the target device is idle.
 */
 static inline int dmz_target_idle(struct dmz_reclaim *zrc)
 {
 	return time_is_before_jiffies(zrc->atime + DMZ_IDLE_PERIOD);
 }
 /*
 * Test if reclaim is necessary.
 */
 static bool dmz_should_reclaim(struct dmz_reclaim *zrc)
 {
 	struct dmz_metadata *zmd = zrc->metadata;
 	unsigned int nr_rnd = dmz_nr_rnd_zones(zmd);
 	unsigned int nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd);
 	unsigned int p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd;
 	/* Reclaim when idle */
 	if (dmz_target_idle(zrc) && nr_unmap_rnd < nr_rnd)
 		return true;
 	/* If there are still plenty of random zones, do not reclaim */
 	if (p_unmap_rnd >= DMZ_RECLAIM_HIGH_UNMAP_RND)
 		return false;
 	/*
 	 * If the percentage of unmappped random zones is low,
 	 * reclaim even if the target is busy.
 	 */
 	return p_unmap_rnd <= DMZ_RECLAIM_LOW_UNMAP_RND;
 }
 /*
 * Reclaim work function.
 */
 static void dmz_reclaim_work(struct work_struct *work)
 {
 	struct dmz_reclaim *zrc = container_of(work, struct dmz_reclaim, work.work);
 	struct dmz_metadata *zmd = zrc->metadata;
 	unsigned int nr_rnd, nr_unmap_rnd;
 	unsigned int p_unmap_rnd;
 	if (!dmz_should_reclaim(zrc)) {
 		mod_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD);
 		return;
 	}
 	/*
 	 * We need to start reclaiming random zones: set up zone copy
 	 * throttling to either go fast if we are very low on random zones
 	 * and slower if there are still some free random zones to avoid
 	 * as much as possible to negatively impact the user workload.
 	 */
 	nr_rnd = dmz_nr_rnd_zones(zmd);
 	nr_unmap_rnd = dmz_nr_unmap_rnd_zones(zmd);
 	p_unmap_rnd = nr_unmap_rnd * 100 / nr_rnd;
 	if (dmz_target_idle(zrc) || p_unmap_rnd < DMZ_RECLAIM_LOW_UNMAP_RND / 2) {
 		/* Idle or very low percentage: go fast */
 		zrc->kc_throttle.throttle = 100;
 	} else {
 		/* Busy but we still have some random zone: throttle */
 		zrc->kc_throttle.throttle = min(75U, 100U - p_unmap_rnd / 2);
 	}
 	dmz_dev_debug(zrc->dev,
 		      "Reclaim (%u): %s, %u%% free rnd zones (%u/%u)",
 		      zrc->kc_throttle.throttle,
 		      (dmz_target_idle(zrc) ? "Idle" : "Busy"),
 		      p_unmap_rnd, nr_unmap_rnd, nr_rnd);
 	dmz_reclaim(zrc);
 	dmz_schedule_reclaim(zrc);
 }
 /*
 * Initialize reclaim.
 */
 int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd,
 		    struct dmz_reclaim **reclaim)
 {
 	struct dmz_reclaim *zrc;
 	int ret;
 	zrc = kzalloc(sizeof(struct dmz_reclaim), GFP_KERNEL);
 	if (!zrc)
 		return -ENOMEM;
 	zrc->dev = dev;
 	zrc->metadata = zmd;
 	zrc->atime = jiffies;
 	/* Reclaim kcopyd client */
 	zrc->kc = dm_kcopyd_client_create(&zrc->kc_throttle);
 	if (IS_ERR(zrc->kc)) {
 		ret = PTR_ERR(zrc->kc);
 		zrc->kc = NULL;
 		goto err;
 	}
 	/* Reclaim work */
 	INIT_DELAYED_WORK(&zrc->work, dmz_reclaim_work);
 	zrc->wq = alloc_ordered_workqueue("dmz_rwq_%s", WQ_MEM_RECLAIM,
 					  dev->name);
 	if (!zrc->wq) {
 		ret = -ENOMEM;
 		goto err;
 	}
 	*reclaim = zrc;
 	queue_delayed_work(zrc->wq, &zrc->work, 0);
 	return 0;
 err:
 	if (zrc->kc)
 		dm_kcopyd_client_destroy(zrc->kc);
 	kfree(zrc);
 	return ret;
 }
 /*
 * Terminate reclaim.
 */
 void dmz_dtr_reclaim(struct dmz_reclaim *zrc)
 {
 	cancel_delayed_work_sync(&zrc->work);
 	destroy_workqueue(zrc->wq);
 	dm_kcopyd_client_destroy(zrc->kc);
 	kfree(zrc);
 }
 /*
 * Suspend reclaim.
 */
 void dmz_suspend_reclaim(struct dmz_reclaim *zrc)
 {
 	cancel_delayed_work_sync(&zrc->work);
 }
 /*
 * Resume reclaim.
 */
 void dmz_resume_reclaim(struct dmz_reclaim *zrc)
 {
 	queue_delayed_work(zrc->wq, &zrc->work, DMZ_IDLE_PERIOD);
 }
 /*
 * BIO accounting.
 */
 void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc)
 {
 	zrc->atime = jiffies;
 }
 /*
 * Start reclaim if necessary.
 */
 void dmz_schedule_reclaim(struct dmz_reclaim *zrc)
 {
 	if (dmz_should_reclaim(zrc))
 		mod_delayed_work(zrc->wq, &zrc->work, 0);
 }
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@ -0,0 +1,967 @@
 /*
 * Copyright (C) 2017 Western Digital Corporation or its affiliates.
 *
 * This file is released under the GPL.
 */
 #include "dm-zoned.h"
 #include <linux/module.h>
 #define	DM_MSG_PREFIX		"zoned"
 #define DMZ_MIN_BIOS		8192
 /*
 * Zone BIO context.
 */
 struct dmz_bioctx {
 	struct dmz_target	*target;
 	struct dm_zone		*zone;
 	struct bio		*bio;
 	atomic_t		ref;
 	blk_status_t		status;
 };
 /*
 * Chunk work descriptor.
 */
 struct dm_chunk_work {
 	struct work_struct	work;
 	atomic_t		refcount;
 	struct dmz_target	*target;
 	unsigned int		chunk;
 	struct bio_list		bio_list;
 };
 /*
 * Target descriptor.
 */
 struct dmz_target {
 	struct dm_dev		*ddev;
 	unsigned long		flags;
 	/* Zoned block device information */
 	struct dmz_dev		*dev;
 	/* For metadata handling */
 	struct dmz_metadata     *metadata;
 	/* For reclaim */
 	struct dmz_reclaim	*reclaim;
 	/* For chunk work */
 	struct mutex		chunk_lock;
 	struct radix_tree_root	chunk_rxtree;
 	struct workqueue_struct *chunk_wq;
 	/* For cloned BIOs to zones */
 	struct bio_set		*bio_set;
 	/* For flush */
 	spinlock_t		flush_lock;
 	struct bio_list		flush_list;
 	struct delayed_work	flush_work;
 	struct workqueue_struct *flush_wq;
 };
 /*
 * Flush intervals (seconds).
 */
 #define DMZ_FLUSH_PERIOD	(10 * HZ)
 /*
 * Target BIO completion.
 */
 static inline void dmz_bio_endio(struct bio *bio, blk_status_t status)
 {
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	if (bioctx->status == BLK_STS_OK && status != BLK_STS_OK)
 		bioctx->status = status;
 	bio_endio(bio);
 }
 /*
 * Partial clone read BIO completion callback. This terminates the
 * target BIO when there are no more references to its context.
 */
 static void dmz_read_bio_end_io(struct bio *bio)
 {
 	struct dmz_bioctx *bioctx = bio->bi_private;
 	blk_status_t status = bio->bi_status;
 	bio_put(bio);
 	dmz_bio_endio(bioctx->bio, status);
 }
 /*
 * Issue a BIO to a zone. The BIO may only partially process the
 * original target BIO.
 */
 static int dmz_submit_read_bio(struct dmz_target *dmz, struct dm_zone *zone,
 			       struct bio *bio, sector_t chunk_block,
 			       unsigned int nr_blocks)
 {
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	sector_t sector;
 	struct bio *clone;
 	/* BIO remap sector */
 	sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block);
 	/* If the read is not partial, there is no need to clone the BIO */
 	if (nr_blocks == dmz_bio_blocks(bio)) {
 		/* Setup and submit the BIO */
 		bio->bi_iter.bi_sector = sector;
 		atomic_inc(&bioctx->ref);
 		generic_make_request(bio);
 		return 0;
 	}
 	/* Partial BIO: we need to clone the BIO */
 	clone = bio_clone_fast(bio, GFP_NOIO, dmz->bio_set);
 	if (!clone)
 		return -ENOMEM;
 	/* Setup the clone */
 	clone->bi_iter.bi_sector = sector;
 	clone->bi_iter.bi_size = dmz_blk2sect(nr_blocks) << SECTOR_SHIFT;
 	clone->bi_end_io = dmz_read_bio_end_io;
 	clone->bi_private = bioctx;
 	bio_advance(bio, clone->bi_iter.bi_size);
 	/* Submit the clone */
 	atomic_inc(&bioctx->ref);
 	generic_make_request(clone);
 	return 0;
 }
 /*
 * Zero out pages of discarded blocks accessed by a read BIO.
 */
 static void dmz_handle_read_zero(struct dmz_target *dmz, struct bio *bio,
 				 sector_t chunk_block, unsigned int nr_blocks)
 {
 	unsigned int size = nr_blocks << DMZ_BLOCK_SHIFT;
 	/* Clear nr_blocks */
 	swap(bio->bi_iter.bi_size, size);
 	zero_fill_bio(bio);
 	swap(bio->bi_iter.bi_size, size);
 	bio_advance(bio, size);
 }
 /*
 * Process a read BIO.
 */
 static int dmz_handle_read(struct dmz_target *dmz, struct dm_zone *zone,
 			   struct bio *bio)
 {
 	sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio));
 	unsigned int nr_blocks = dmz_bio_blocks(bio);
 	sector_t end_block = chunk_block + nr_blocks;
 	struct dm_zone *rzone, *bzone;
 	int ret;
 	/* Read into unmapped chunks need only zeroing the BIO buffer */
 	if (!zone) {
 		zero_fill_bio(bio);
 		return 0;
 	}
 	dmz_dev_debug(dmz->dev, "READ chunk %llu -> %s zone %u, block %llu, %u blocks",
 		      (unsigned long long)dmz_bio_chunk(dmz->dev, bio),
 		      (dmz_is_rnd(zone) ? "RND" : "SEQ"),
 		      dmz_id(dmz->metadata, zone),
 		      (unsigned long long)chunk_block, nr_blocks);
 	/* Check block validity to determine the read location */
 	bzone = zone->bzone;
 	while (chunk_block < end_block) {
 		nr_blocks = 0;
 		if (dmz_is_rnd(zone) || chunk_block < zone->wp_block) {
 			/* Test block validity in the data zone */
 			ret = dmz_block_valid(dmz->metadata, zone, chunk_block);
 			if (ret < 0)
 				return ret;
 			if (ret > 0) {
 				/* Read data zone blocks */
 				nr_blocks = ret;
 				rzone = zone;
 			}
 		}
 		/*
 		 * No valid blocks found in the data zone.
 		 * Check the buffer zone, if there is one.
 		 */
 		if (!nr_blocks && bzone) {
 			ret = dmz_block_valid(dmz->metadata, bzone, chunk_block);
 			if (ret < 0)
 				return ret;
 			if (ret > 0) {
 				/* Read buffer zone blocks */
 				nr_blocks = ret;
 				rzone = bzone;
 			}
 		}
 		if (nr_blocks) {
 			/* Valid blocks found: read them */
 			nr_blocks = min_t(unsigned int, nr_blocks, end_block - chunk_block);
 			ret = dmz_submit_read_bio(dmz, rzone, bio, chunk_block, nr_blocks);
 			if (ret)
 				return ret;
 			chunk_block += nr_blocks;
 		} else {
 			/* No valid block: zeroout the current BIO block */
 			dmz_handle_read_zero(dmz, bio, chunk_block, 1);
 			chunk_block++;
 		}
 	}
 	return 0;
 }
 /*
 * Issue a write BIO to a zone.
 */
 static void dmz_submit_write_bio(struct dmz_target *dmz, struct dm_zone *zone,
 				 struct bio *bio, sector_t chunk_block,
 				 unsigned int nr_blocks)
 {
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	/* Setup and submit the BIO */
 	bio->bi_bdev = dmz->dev->bdev;
 	bio->bi_iter.bi_sector = dmz_start_sect(dmz->metadata, zone) + dmz_blk2sect(chunk_block);
 	atomic_inc(&bioctx->ref);
 	generic_make_request(bio);
 	if (dmz_is_seq(zone))
 		zone->wp_block += nr_blocks;
 }
 /*
 * Write blocks directly in a data zone, at the write pointer.
 * If a buffer zone is assigned, invalidate the blocks written
 * in place.
 */
 static int dmz_handle_direct_write(struct dmz_target *dmz,
 				   struct dm_zone *zone, struct bio *bio,
 				   sector_t chunk_block,
 				   unsigned int nr_blocks)
 {
 	struct dmz_metadata *zmd = dmz->metadata;
 	struct dm_zone *bzone = zone->bzone;
 	int ret;
 	if (dmz_is_readonly(zone))
 		return -EROFS;
 	/* Submit write */
 	dmz_submit_write_bio(dmz, zone, bio, chunk_block, nr_blocks);
 	/*
 	 * Validate the blocks in the data zone and invalidate
 	 * in the buffer zone, if there is one.
 	 */
 	ret = dmz_validate_blocks(zmd, zone, chunk_block, nr_blocks);
 	if (ret == 0 && bzone)
 		ret = dmz_invalidate_blocks(zmd, bzone, chunk_block, nr_blocks);
 	return ret;
 }
 /*
 * Write blocks in the buffer zone of @zone.
 * If no buffer zone is assigned yet, get one.
 * Called with @zone write locked.
 */
 static int dmz_handle_buffered_write(struct dmz_target *dmz,
 				     struct dm_zone *zone, struct bio *bio,
 				     sector_t chunk_block,
 				     unsigned int nr_blocks)
 {
 	struct dmz_metadata *zmd = dmz->metadata;
 	struct dm_zone *bzone;
 	int ret;
 	/* Get the buffer zone. One will be allocated if needed */
 	bzone = dmz_get_chunk_buffer(zmd, zone);
 	if (!bzone)
 		return -ENOSPC;
 	if (dmz_is_readonly(bzone))
 		return -EROFS;
 	/* Submit write */
 	dmz_submit_write_bio(dmz, bzone, bio, chunk_block, nr_blocks);
 	/*
 	 * Validate the blocks in the buffer zone
 	 * and invalidate in the data zone.
 	 */
 	ret = dmz_validate_blocks(zmd, bzone, chunk_block, nr_blocks);
 	if (ret == 0 && chunk_block < zone->wp_block)
 		ret = dmz_invalidate_blocks(zmd, zone, chunk_block, nr_blocks);
 	return ret;
 }
 /*
 * Process a write BIO.
 */
 static int dmz_handle_write(struct dmz_target *dmz, struct dm_zone *zone,
 			    struct bio *bio)
 {
 	sector_t chunk_block = dmz_chunk_block(dmz->dev, dmz_bio_block(bio));
 	unsigned int nr_blocks = dmz_bio_blocks(bio);
 	if (!zone)
 		return -ENOSPC;
 	dmz_dev_debug(dmz->dev, "WRITE chunk %llu -> %s zone %u, block %llu, %u blocks",
 		      (unsigned long long)dmz_bio_chunk(dmz->dev, bio),
 		      (dmz_is_rnd(zone) ? "RND" : "SEQ"),
 		      dmz_id(dmz->metadata, zone),
 		      (unsigned long long)chunk_block, nr_blocks);
 	if (dmz_is_rnd(zone) || chunk_block == zone->wp_block) {
 		/*
 		 * zone is a random zone or it is a sequential zone
 		 * and the BIO is aligned to the zone write pointer:
 		 * direct write the zone.
 		 */
 		return dmz_handle_direct_write(dmz, zone, bio, chunk_block, nr_blocks);
 	}
 	/*
 	 * This is an unaligned write in a sequential zone:
 	 * use buffered write.
 	 */
 	return dmz_handle_buffered_write(dmz, zone, bio, chunk_block, nr_blocks);
 }
 /*
 * Process a discard BIO.
 */
 static int dmz_handle_discard(struct dmz_target *dmz, struct dm_zone *zone,
 			      struct bio *bio)
 {
 	struct dmz_metadata *zmd = dmz->metadata;
 	sector_t block = dmz_bio_block(bio);
 	unsigned int nr_blocks = dmz_bio_blocks(bio);
 	sector_t chunk_block = dmz_chunk_block(dmz->dev, block);
 	int ret = 0;
 	/* For unmapped chunks, there is nothing to do */
 	if (!zone)
 		return 0;
 	if (dmz_is_readonly(zone))
 		return -EROFS;
 	dmz_dev_debug(dmz->dev, "DISCARD chunk %llu -> zone %u, block %llu, %u blocks",
 		      (unsigned long long)dmz_bio_chunk(dmz->dev, bio),
 		      dmz_id(zmd, zone),
 		      (unsigned long long)chunk_block, nr_blocks);
 	/*
 	 * Invalidate blocks in the data zone and its
 	 * buffer zone if one is mapped.
 	 */
 	if (dmz_is_rnd(zone) || chunk_block < zone->wp_block)
 		ret = dmz_invalidate_blocks(zmd, zone, chunk_block, nr_blocks);
 	if (ret == 0 && zone->bzone)
 		ret = dmz_invalidate_blocks(zmd, zone->bzone,
 					    chunk_block, nr_blocks);
 	return ret;
 }
 /*
 * Process a BIO.
 */
 static void dmz_handle_bio(struct dmz_target *dmz, struct dm_chunk_work *cw,
 			   struct bio *bio)
 {
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	struct dmz_metadata *zmd = dmz->metadata;
 	struct dm_zone *zone;
 	int ret;
 	/*
 	 * Write may trigger a zone allocation. So make sure the
 	 * allocation can succeed.
 	 */
 	if (bio_op(bio) == REQ_OP_WRITE)
 		dmz_schedule_reclaim(dmz->reclaim);
 	dmz_lock_metadata(zmd);
 	/*
 	 * Get the data zone mapping the chunk. There may be no
 	 * mapping for read and discard. If a mapping is obtained,
 	 + the zone returned will be set to active state.
 	 */
 	zone = dmz_get_chunk_mapping(zmd, dmz_bio_chunk(dmz->dev, bio),
 				     bio_op(bio));
 	if (IS_ERR(zone)) {
 		ret = PTR_ERR(zone);
 		goto out;
 	}
 	/* Process the BIO */
 	if (zone) {
 		dmz_activate_zone(zone);
 		bioctx->zone = zone;
 	}
 	switch (bio_op(bio)) {
 	case REQ_OP_READ:
 		ret = dmz_handle_read(dmz, zone, bio);
 		break;
 	case REQ_OP_WRITE:
 		ret = dmz_handle_write(dmz, zone, bio);
 		break;
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
 		ret = dmz_handle_discard(dmz, zone, bio);
 		break;
 	default:
 		dmz_dev_err(dmz->dev, "Unsupported BIO operation 0x%x",
 			    bio_op(bio));
 		ret = -EIO;
 	}
 	/*
 	 * Release the chunk mapping. This will check that the mapping
 	 * is still valid, that is, that the zone used still has valid blocks.
 	 */
 	if (zone)
 		dmz_put_chunk_mapping(zmd, zone);
 out:
 	dmz_bio_endio(bio, errno_to_blk_status(ret));
 	dmz_unlock_metadata(zmd);
 }
 /*
 * Increment a chunk reference counter.
 */
 static inline void dmz_get_chunk_work(struct dm_chunk_work *cw)
 {
 	atomic_inc(&cw->refcount);
 }
 /*
 * Decrement a chunk work reference count and
 * free it if it becomes 0.
 */
 static void dmz_put_chunk_work(struct dm_chunk_work *cw)
 {
 	if (atomic_dec_and_test(&cw->refcount)) {
 		WARN_ON(!bio_list_empty(&cw->bio_list));
 		radix_tree_delete(&cw->target->chunk_rxtree, cw->chunk);
 		kfree(cw);
 	}
 }
 /*
 * Chunk BIO work function.
 */
 static void dmz_chunk_work(struct work_struct *work)
 {
 	struct dm_chunk_work *cw = container_of(work, struct dm_chunk_work, work);
 	struct dmz_target *dmz = cw->target;
 	struct bio *bio;
 	mutex_lock(&dmz->chunk_lock);
 	/* Process the chunk BIOs */
 	while ((bio = bio_list_pop(&cw->bio_list))) {
 		mutex_unlock(&dmz->chunk_lock);
 		dmz_handle_bio(dmz, cw, bio);
 		mutex_lock(&dmz->chunk_lock);
 		dmz_put_chunk_work(cw);
 	}
 	/* Queueing the work incremented the work refcount */
 	dmz_put_chunk_work(cw);
 	mutex_unlock(&dmz->chunk_lock);
 }
 /*
 * Flush work.
 */
 static void dmz_flush_work(struct work_struct *work)
 {
 	struct dmz_target *dmz = container_of(work, struct dmz_target, flush_work.work);
 	struct bio *bio;
 	int ret;
 	/* Flush dirty metadata blocks */
 	ret = dmz_flush_metadata(dmz->metadata);
 	/* Process queued flush requests */
 	while (1) {
 		spin_lock(&dmz->flush_lock);
 		bio = bio_list_pop(&dmz->flush_list);
 		spin_unlock(&dmz->flush_lock);
 		if (!bio)
 			break;
 		dmz_bio_endio(bio, errno_to_blk_status(ret));
 	}
 	queue_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD);
 }
 /*
 * Get a chunk work and start it to process a new BIO.
 * If the BIO chunk has no work yet, create one.
 */
 static void dmz_queue_chunk_work(struct dmz_target *dmz, struct bio *bio)
 {
 	unsigned int chunk = dmz_bio_chunk(dmz->dev, bio);
 	struct dm_chunk_work *cw;
 	mutex_lock(&dmz->chunk_lock);
 	/* Get the BIO chunk work. If one is not active yet, create one */
 	cw = radix_tree_lookup(&dmz->chunk_rxtree, chunk);
 	if (!cw) {
 		int ret;
 		/* Create a new chunk work */
 		cw = kmalloc(sizeof(struct dm_chunk_work), GFP_NOFS);
 		if (!cw)
 			goto out;
 		INIT_WORK(&cw->work, dmz_chunk_work);
 		atomic_set(&cw->refcount, 0);
 		cw->target = dmz;
 		cw->chunk = chunk;
 		bio_list_init(&cw->bio_list);
 		ret = radix_tree_insert(&dmz->chunk_rxtree, chunk, cw);
 		if (unlikely(ret)) {
 			kfree(cw);
 			cw = NULL;
 			goto out;
 		}
 	}
 	bio_list_add(&cw->bio_list, bio);
 	dmz_get_chunk_work(cw);
 	if (queue_work(dmz->chunk_wq, &cw->work))
 		dmz_get_chunk_work(cw);
 out:
 	mutex_unlock(&dmz->chunk_lock);
 }
 /*
 * Process a new BIO.
 */
 static int dmz_map(struct dm_target *ti, struct bio *bio)
 {
 	struct dmz_target *dmz = ti->private;
 	struct dmz_dev *dev = dmz->dev;
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	sector_t sector = bio->bi_iter.bi_sector;
 	unsigned int nr_sectors = bio_sectors(bio);
 	sector_t chunk_sector;
 	dmz_dev_debug(dev, "BIO op %d sector %llu + %u => chunk %llu, block %llu, %u blocks",
 		      bio_op(bio), (unsigned long long)sector, nr_sectors,
 		      (unsigned long long)dmz_bio_chunk(dmz->dev, bio),
 		      (unsigned long long)dmz_chunk_block(dmz->dev, dmz_bio_block(bio)),
 		      (unsigned int)dmz_bio_blocks(bio));
 	bio->bi_bdev = dev->bdev;
 	if (!nr_sectors && (bio_op(bio) != REQ_OP_FLUSH) && (bio_op(bio) != REQ_OP_WRITE))
 		return DM_MAPIO_REMAPPED;
 	/* The BIO should be block aligned */
 	if ((nr_sectors & DMZ_BLOCK_SECTORS_MASK) || (sector & DMZ_BLOCK_SECTORS_MASK))
 		return DM_MAPIO_KILL;
 	/* Initialize the BIO context */
 	bioctx->target = dmz;
 	bioctx->zone = NULL;
 	bioctx->bio = bio;
 	atomic_set(&bioctx->ref, 1);
 	bioctx->status = BLK_STS_OK;
 	/* Set the BIO pending in the flush list */
 	if (bio_op(bio) == REQ_OP_FLUSH || (!nr_sectors && bio_op(bio) == REQ_OP_WRITE)) {
 		spin_lock(&dmz->flush_lock);
 		bio_list_add(&dmz->flush_list, bio);
 		spin_unlock(&dmz->flush_lock);
 		mod_delayed_work(dmz->flush_wq, &dmz->flush_work, 0);
 		return DM_MAPIO_SUBMITTED;
 	}
 	/* Split zone BIOs to fit entirely into a zone */
 	chunk_sector = sector & (dev->zone_nr_sectors - 1);
 	if (chunk_sector + nr_sectors > dev->zone_nr_sectors)
 		dm_accept_partial_bio(bio, dev->zone_nr_sectors - chunk_sector);
 	/* Now ready to handle this BIO */
 	dmz_reclaim_bio_acc(dmz->reclaim);
 	dmz_queue_chunk_work(dmz, bio);
 	return DM_MAPIO_SUBMITTED;
 }
 /*
 * Completed target BIO processing.
 */
 static int dmz_end_io(struct dm_target *ti, struct bio *bio, blk_status_t *error)
 {
 	struct dmz_bioctx *bioctx = dm_per_bio_data(bio, sizeof(struct dmz_bioctx));
 	if (bioctx->status == BLK_STS_OK && *error)
 		bioctx->status = *error;
 	if (!atomic_dec_and_test(&bioctx->ref))
 		return DM_ENDIO_INCOMPLETE;
 	/* Done */
 	bio->bi_status = bioctx->status;
 	if (bioctx->zone) {
 		struct dm_zone *zone = bioctx->zone;
 		if (*error && bio_op(bio) == REQ_OP_WRITE) {
 			if (dmz_is_seq(zone))
 				set_bit(DMZ_SEQ_WRITE_ERR, &zone->flags);
 		}
 		dmz_deactivate_zone(zone);
 	}
 	return DM_ENDIO_DONE;
 }
 /*
 * Get zoned device information.
 */
 static int dmz_get_zoned_device(struct dm_target *ti, char *path)
 {
 	struct dmz_target *dmz = ti->private;
 	struct request_queue *q;
 	struct dmz_dev *dev;
 	int ret;
 	/* Get the target device */
 	ret = dm_get_device(ti, path, dm_table_get_mode(ti->table), &dmz->ddev);
 	if (ret) {
 		ti->error = "Get target device failed";
 		dmz->ddev = NULL;
 		return ret;
 	}
 	dev = kzalloc(sizeof(struct dmz_dev), GFP_KERNEL);
 	if (!dev) {
 		ret = -ENOMEM;
 		goto err;
 	}
 	dev->bdev = dmz->ddev->bdev;
 	(void)bdevname(dev->bdev, dev->name);
 	if (bdev_zoned_model(dev->bdev) == BLK_ZONED_NONE) {
 		ti->error = "Not a zoned block device";
 		ret = -EINVAL;
 		goto err;
 	}
 	dev->capacity = i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
 	if (ti->begin || (ti->len != dev->capacity)) {
 		ti->error = "Partial mapping not supported";
 		ret = -EINVAL;
 		goto err;
 	}
 	q = bdev_get_queue(dev->bdev);
 	dev->zone_nr_sectors = q->limits.chunk_sectors;
 	dev->zone_nr_sectors_shift = ilog2(dev->zone_nr_sectors);
 	dev->zone_nr_blocks = dmz_sect2blk(dev->zone_nr_sectors);
 	dev->zone_nr_blocks_shift = ilog2(dev->zone_nr_blocks);
 	dev->nr_zones = (dev->capacity + dev->zone_nr_sectors - 1)
 		>> dev->zone_nr_sectors_shift;
 	dmz->dev = dev;
 	return 0;
 err:
 	dm_put_device(ti, dmz->ddev);
 	kfree(dev);
 	return ret;
 }
 /*
 * Cleanup zoned device information.
 */
 static void dmz_put_zoned_device(struct dm_target *ti)
 {
 	struct dmz_target *dmz = ti->private;
 	dm_put_device(ti, dmz->ddev);
 	kfree(dmz->dev);
 	dmz->dev = NULL;
 }
 /*
 * Setup target.
 */
 static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
 	struct dmz_target *dmz;
 	struct dmz_dev *dev;
 	int ret;
 	/* Check arguments */
 	if (argc != 1) {
 		ti->error = "Invalid argument count";
 		return -EINVAL;
 	}
 	/* Allocate and initialize the target descriptor */
 	dmz = kzalloc(sizeof(struct dmz_target), GFP_KERNEL);
 	if (!dmz) {
 		ti->error = "Unable to allocate the zoned target descriptor";
 		return -ENOMEM;
 	}
 	ti->private = dmz;
 	/* Get the target zoned block device */
 	ret = dmz_get_zoned_device(ti, argv[0]);
 	if (ret) {
 		dmz->ddev = NULL;
 		goto err;
 	}
 	/* Initialize metadata */
 	dev = dmz->dev;
 	ret = dmz_ctr_metadata(dev, &dmz->metadata);
 	if (ret) {
 		ti->error = "Metadata initialization failed";
 		goto err_dev;
 	}
 	/* Set target (no write same support) */
 	ti->max_io_len = dev->zone_nr_sectors << 9;
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_zeroes_bios = 1;
 	ti->per_io_data_size = sizeof(struct dmz_bioctx);
 	ti->flush_supported = true;
 	ti->discards_supported = true;
 	ti->split_discard_bios = true;
 	/* The exposed capacity is the number of chunks that can be mapped */
 	ti->len = (sector_t)dmz_nr_chunks(dmz->metadata) << dev->zone_nr_sectors_shift;
 	/* Zone BIO */
 	dmz->bio_set = bioset_create(DMZ_MIN_BIOS, 0, 0);
 	if (!dmz->bio_set) {
 		ti->error = "Create BIO set failed";
 		ret = -ENOMEM;
 		goto err_meta;
 	}
 	/* Chunk BIO work */
 	mutex_init(&dmz->chunk_lock);
 	INIT_RADIX_TREE(&dmz->chunk_rxtree, GFP_NOFS);
 	dmz->chunk_wq = alloc_workqueue("dmz_cwq_%s", WQ_MEM_RECLAIM | WQ_UNBOUND,
 					0, dev->name);
 	if (!dmz->chunk_wq) {
 		ti->error = "Create chunk workqueue failed";
 		ret = -ENOMEM;
 		goto err_bio;
 	}
 	/* Flush work */
 	spin_lock_init(&dmz->flush_lock);
 	bio_list_init(&dmz->flush_list);
 	INIT_DELAYED_WORK(&dmz->flush_work, dmz_flush_work);
 	dmz->flush_wq = alloc_ordered_workqueue("dmz_fwq_%s", WQ_MEM_RECLAIM,
 						dev->name);
 	if (!dmz->flush_wq) {
 		ti->error = "Create flush workqueue failed";
 		ret = -ENOMEM;
 		goto err_cwq;
 	}
 	mod_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD);
 	/* Initialize reclaim */
 	ret = dmz_ctr_reclaim(dev, dmz->metadata, &dmz->reclaim);
 	if (ret) {
 		ti->error = "Zone reclaim initialization failed";
 		goto err_fwq;
 	}
 	dmz_dev_info(dev, "Target device: %llu 512-byte logical sectors (%llu blocks)",
 		     (unsigned long long)ti->len,
 		     (unsigned long long)dmz_sect2blk(ti->len));
 	return 0;
 err_fwq:
 	destroy_workqueue(dmz->flush_wq);
 err_cwq:
 	destroy_workqueue(dmz->chunk_wq);
 err_bio:
 	bioset_free(dmz->bio_set);
 err_meta:
 	dmz_dtr_metadata(dmz->metadata);
 err_dev:
 	dmz_put_zoned_device(ti);
 err:
 	kfree(dmz);
 	return ret;
 }
 /*
 * Cleanup target.
 */
 static void dmz_dtr(struct dm_target *ti)
 {
 	struct dmz_target *dmz = ti->private;
 	flush_workqueue(dmz->chunk_wq);
 	destroy_workqueue(dmz->chunk_wq);
 	dmz_dtr_reclaim(dmz->reclaim);
 	cancel_delayed_work_sync(&dmz->flush_work);
 	destroy_workqueue(dmz->flush_wq);
 	(void) dmz_flush_metadata(dmz->metadata);
 	dmz_dtr_metadata(dmz->metadata);
 	bioset_free(dmz->bio_set);
 	dmz_put_zoned_device(ti);
 	kfree(dmz);
 }
 /*
 * Setup target request queue limits.
 */
 static void dmz_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
 	struct dmz_target *dmz = ti->private;
 	unsigned int chunk_sectors = dmz->dev->zone_nr_sectors;
 	limits->logical_block_size = DMZ_BLOCK_SIZE;
 	limits->physical_block_size = DMZ_BLOCK_SIZE;
 	blk_limits_io_min(limits, DMZ_BLOCK_SIZE);
 	blk_limits_io_opt(limits, DMZ_BLOCK_SIZE);
 	limits->discard_alignment = DMZ_BLOCK_SIZE;
 	limits->discard_granularity = DMZ_BLOCK_SIZE;
 	limits->max_discard_sectors = chunk_sectors;
 	limits->max_hw_discard_sectors = chunk_sectors;
 	limits->max_write_zeroes_sectors = chunk_sectors;
 	/* FS hint to try to align to the device zone size */
 	limits->chunk_sectors = chunk_sectors;
 	limits->max_sectors = chunk_sectors;
 	/* We are exposing a drive-managed zoned block device */
 	limits->zoned = BLK_ZONED_NONE;
 }
 /*
 * Pass on ioctl to the backend device.
 */
 static int dmz_prepare_ioctl(struct dm_target *ti,
 			     struct block_device **bdev, fmode_t *mode)
 {
 	struct dmz_target *dmz = ti->private;
 	*bdev = dmz->dev->bdev;
 	return 0;
 }
 /*
 * Stop works on suspend.
 */
 static void dmz_suspend(struct dm_target *ti)
 {
 	struct dmz_target *dmz = ti->private;
 	flush_workqueue(dmz->chunk_wq);
 	dmz_suspend_reclaim(dmz->reclaim);
 	cancel_delayed_work_sync(&dmz->flush_work);
 }
 /*
 * Restart works on resume or if suspend failed.
 */
 static void dmz_resume(struct dm_target *ti)
 {
 	struct dmz_target *dmz = ti->private;
 	queue_delayed_work(dmz->flush_wq, &dmz->flush_work, DMZ_FLUSH_PERIOD);
 	dmz_resume_reclaim(dmz->reclaim);
 }
 static int dmz_iterate_devices(struct dm_target *ti,
 			       iterate_devices_callout_fn fn, void *data)
 {
 	struct dmz_target *dmz = ti->private;
 	return fn(ti, dmz->ddev, 0, dmz->dev->capacity, data);
 }
 static struct target_type dmz_type = {
 	.name		 = "zoned",
 	.version	 = {1, 0, 0},
 	.features	 = DM_TARGET_SINGLETON | DM_TARGET_ZONED_HM,
 	.module		 = THIS_MODULE,
 	.ctr		 = dmz_ctr,
 	.dtr		 = dmz_dtr,
 	.map		 = dmz_map,
 	.end_io		 = dmz_end_io,
 	.io_hints	 = dmz_io_hints,
 	.prepare_ioctl	 = dmz_prepare_ioctl,
 	.postsuspend	 = dmz_suspend,
 	.resume		 = dmz_resume,
 	.iterate_devices = dmz_iterate_devices,
 };
 static int __init dmz_init(void)
 {
 	return dm_register_target(&dmz_type);
 }
 static void __exit dmz_exit(void)
 {
 	dm_unregister_target(&dmz_type);
 }
 module_init(dmz_init);
 module_exit(dmz_exit);
 MODULE_DESCRIPTION(DM_NAME " target for zoned block devices");
 MODULE_AUTHOR("Damien Le Moal <damien.lemoal@wdc.com>");
 MODULE_LICENSE("GPL");
--- a/drivers/md/dm-zoned.h
+++ b/drivers/md/dm-zoned.h
@ -0,0 +1,228 @@
 /*
 * Copyright (C) 2017 Western Digital Corporation or its affiliates.
 *
 * This file is released under the GPL.
 */
 #ifndef DM_ZONED_H
 #define DM_ZONED_H
 #include <linux/types.h>
 #include <linux/blkdev.h>
 #include <linux/device-mapper.h>
 #include <linux/dm-kcopyd.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/rwsem.h>
 #include <linux/rbtree.h>
 #include <linux/radix-tree.h>
 #include <linux/shrinker.h>
 /*
 * dm-zoned creates block devices with 4KB blocks, always.
 */
 #define DMZ_BLOCK_SHIFT		12
 #define DMZ_BLOCK_SIZE		(1 << DMZ_BLOCK_SHIFT)
 #define DMZ_BLOCK_MASK		(DMZ_BLOCK_SIZE - 1)
 #define DMZ_BLOCK_SHIFT_BITS	(DMZ_BLOCK_SHIFT + 3)
 #define DMZ_BLOCK_SIZE_BITS	(1 << DMZ_BLOCK_SHIFT_BITS)
 #define DMZ_BLOCK_MASK_BITS	(DMZ_BLOCK_SIZE_BITS - 1)
 #define DMZ_BLOCK_SECTORS_SHIFT	(DMZ_BLOCK_SHIFT - SECTOR_SHIFT)
 #define DMZ_BLOCK_SECTORS	(DMZ_BLOCK_SIZE >> SECTOR_SHIFT)
 #define DMZ_BLOCK_SECTORS_MASK	(DMZ_BLOCK_SECTORS - 1)
 /*
 * 4KB block <-> 512B sector conversion.
 */
 #define dmz_blk2sect(b)		((sector_t)(b) << DMZ_BLOCK_SECTORS_SHIFT)
 #define dmz_sect2blk(s)		((sector_t)(s) >> DMZ_BLOCK_SECTORS_SHIFT)
 #define dmz_bio_block(bio)	dmz_sect2blk((bio)->bi_iter.bi_sector)
 #define dmz_bio_blocks(bio)	dmz_sect2blk(bio_sectors(bio))
 /*
 * Zoned block device information.
 */
 struct dmz_dev {
 	struct block_device	*bdev;
 	char			name[BDEVNAME_SIZE];
 	sector_t		capacity;
 	unsigned int		nr_zones;
 	sector_t		zone_nr_sectors;
 	unsigned int		zone_nr_sectors_shift;
 	sector_t		zone_nr_blocks;
 	sector_t		zone_nr_blocks_shift;
 };
 #define dmz_bio_chunk(dev, bio)	((bio)->bi_iter.bi_sector >> \
 				 (dev)->zone_nr_sectors_shift)
 #define dmz_chunk_block(dev, b)	((b) & ((dev)->zone_nr_blocks - 1))
 /*
 * Zone descriptor.
 */
 struct dm_zone {
 	/* For listing the zone depending on its state */
 	struct list_head	link;
 	/* Zone type and state */
 	unsigned long		flags;
 	/* Zone activation reference count */
 	atomic_t		refcount;
 	/* Zone write pointer block (relative to the zone start block) */
 	unsigned int		wp_block;
 	/* Zone weight (number of valid blocks in the zone) */
 	unsigned int		weight;
 	/* The chunk that the zone maps */
 	unsigned int		chunk;
 	/*
 	 * For a sequential data zone, pointer to the random zone
 	 * used as a buffer for processing unaligned writes.
 	 * For a buffer zone, this points back to the data zone.
 	 */
 	struct dm_zone		*bzone;
 };
 /*
 * Zone flags.
 */
 enum {
 	/* Zone write type */
 	DMZ_RND,
 	DMZ_SEQ,
 	/* Zone critical condition */
 	DMZ_OFFLINE,
 	DMZ_READ_ONLY,
 	/* How the zone is being used */
 	DMZ_META,
 	DMZ_DATA,
 	DMZ_BUF,
 	/* Zone internal state */
 	DMZ_ACTIVE,
 	DMZ_RECLAIM,
 	DMZ_SEQ_WRITE_ERR,
 };
 /*
 * Zone data accessors.
 */
 #define dmz_is_rnd(z)		test_bit(DMZ_RND, &(z)->flags)
 #define dmz_is_seq(z)		test_bit(DMZ_SEQ, &(z)->flags)
 #define dmz_is_empty(z)		((z)->wp_block == 0)
 #define dmz_is_offline(z)	test_bit(DMZ_OFFLINE, &(z)->flags)
 #define dmz_is_readonly(z)	test_bit(DMZ_READ_ONLY, &(z)->flags)
 #define dmz_is_active(z)	test_bit(DMZ_ACTIVE, &(z)->flags)
 #define dmz_in_reclaim(z)	test_bit(DMZ_RECLAIM, &(z)->flags)
 #define dmz_seq_write_err(z)	test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags)
 #define dmz_is_meta(z)		test_bit(DMZ_META, &(z)->flags)
 #define dmz_is_buf(z)		test_bit(DMZ_BUF, &(z)->flags)
 #define dmz_is_data(z)		test_bit(DMZ_DATA, &(z)->flags)
 #define dmz_weight(z)		((z)->weight)
 /*
 * Message functions.
 */
 #define dmz_dev_info(dev, format, args...)	\
 	DMINFO("(%s): " format, (dev)->name, ## args)
 #define dmz_dev_err(dev, format, args...)	\
 	DMERR("(%s): " format, (dev)->name, ## args)
 #define dmz_dev_warn(dev, format, args...)	\
 	DMWARN("(%s): " format, (dev)->name, ## args)
 #define dmz_dev_debug(dev, format, args...)	\
 	DMDEBUG("(%s): " format, (dev)->name, ## args)
 struct dmz_metadata;
 struct dmz_reclaim;
 /*
 * Functions defined in dm-zoned-metadata.c
 */
 int dmz_ctr_metadata(struct dmz_dev *dev, struct dmz_metadata **zmd);
 void dmz_dtr_metadata(struct dmz_metadata *zmd);
 int dmz_resume_metadata(struct dmz_metadata *zmd);
 void dmz_lock_map(struct dmz_metadata *zmd);
 void dmz_unlock_map(struct dmz_metadata *zmd);
 void dmz_lock_metadata(struct dmz_metadata *zmd);
 void dmz_unlock_metadata(struct dmz_metadata *zmd);
 void dmz_lock_flush(struct dmz_metadata *zmd);
 void dmz_unlock_flush(struct dmz_metadata *zmd);
 int dmz_flush_metadata(struct dmz_metadata *zmd);
 unsigned int dmz_id(struct dmz_metadata *zmd, struct dm_zone *zone);
 sector_t dmz_start_sect(struct dmz_metadata *zmd, struct dm_zone *zone);
 sector_t dmz_start_block(struct dmz_metadata *zmd, struct dm_zone *zone);
 unsigned int dmz_nr_chunks(struct dmz_metadata *zmd);
 #define DMZ_ALLOC_RND		0x01
 #define DMZ_ALLOC_RECLAIM	0x02
 struct dm_zone *dmz_alloc_zone(struct dmz_metadata *zmd, unsigned long flags);
 void dmz_free_zone(struct dmz_metadata *zmd, struct dm_zone *zone);
 void dmz_map_zone(struct dmz_metadata *zmd, struct dm_zone *zone,
 		  unsigned int chunk);
 void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone);
 unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd);
 unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd);
 void dmz_activate_zone(struct dm_zone *zone);
 void dmz_deactivate_zone(struct dm_zone *zone);
 int dmz_lock_zone_reclaim(struct dm_zone *zone);
 void dmz_unlock_zone_reclaim(struct dm_zone *zone);
 struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd);
 struct dm_zone *dmz_get_chunk_mapping(struct dmz_metadata *zmd,
 				      unsigned int chunk, int op);
 void dmz_put_chunk_mapping(struct dmz_metadata *zmd, struct dm_zone *zone);
 struct dm_zone *dmz_get_chunk_buffer(struct dmz_metadata *zmd,
 				     struct dm_zone *dzone);
 int dmz_validate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone,
 			sector_t chunk_block, unsigned int nr_blocks);
 int dmz_invalidate_blocks(struct dmz_metadata *zmd, struct dm_zone *zone,
 			  sector_t chunk_block, unsigned int nr_blocks);
 int dmz_block_valid(struct dmz_metadata *zmd, struct dm_zone *zone,
 		    sector_t chunk_block);
 int dmz_first_valid_block(struct dmz_metadata *zmd, struct dm_zone *zone,
 			  sector_t *chunk_block);
 int dmz_copy_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone,
 			  struct dm_zone *to_zone);
 int dmz_merge_valid_blocks(struct dmz_metadata *zmd, struct dm_zone *from_zone,
 			   struct dm_zone *to_zone, sector_t chunk_block);
 /*
 * Functions defined in dm-zoned-reclaim.c
 */
 int dmz_ctr_reclaim(struct dmz_dev *dev, struct dmz_metadata *zmd,
 		    struct dmz_reclaim **zrc);
 void dmz_dtr_reclaim(struct dmz_reclaim *zrc);
 void dmz_suspend_reclaim(struct dmz_reclaim *zrc);
 void dmz_resume_reclaim(struct dmz_reclaim *zrc);
 void dmz_reclaim_bio_acc(struct dmz_reclaim *zrc);
 void dmz_schedule_reclaim(struct dmz_reclaim *zrc);
 #endif /* DM_ZONED_H */