Граф коммитов

513 Коммитов

Автор SHA1 Сообщение Дата
Dan Williams 991d9020f3 libnvdimm, namespace: lift single pmem limit in scan_labels()
Now that the rest of the infrastructure has been converted to handle
multi-pmem configurations, lift the artificial barrier at scan time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams c969e24c1b libnvdimm, namespace: filter out of range labels in scan_labels()
Short-circuit doomed-to-fail label validation attempts by skipping
labels that are outside the given region.  For example a DIMM that has
multiple PMEM regions will waste time attempting to create namespaces
only to find that the interleave-set-cookie does not validate, e.g.:

    nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2

Similar to how we skip BLK labels when performing PMEM validation we can
skip out-of-range labels early.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams 762d067dba libnvdimm, namespace: enable allocation of multiple pmem namespaces
Now that we have nd_region_available_dpa() able to handle the presence
of multiple PMEM allocations in aliased PMEM regions, reuse that same
infrastructure to track allocations from free space.  In particular
handle allocating from an aliased PMEM region in the case where there
are dis-contiguous holes.  The allocation for BLK and PMEM are
documented in the space_valid() helper:

    BLK-space is valid as long as it does not precede a PMEM
    allocation in a given region. PMEM-space must be contiguous
    and adjacent to an existing existing allocation (if one
    exists).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams 16660eaea0 libnvdimm, namespace: update label implementation for multi-pmem
Instead of assuming that there will only ever be one allocated range at
the start of the region, account for additional namespaces that might
start at an offset from the region base.

After this change pmem namespaces now have a reason to carry an array of
resources similar to blk.  Unifying the resource tracking infrastructure
in nd_namespace_common is a future cleanup candidate.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams 012207334a libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
pmem devices are currently named /dev/pmem<region-index>. Preserve the
naming of the 0th device, but add a ".<namespace-index>" for other
devices.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:22:53 -07:00
Dan Williams a1f3e4d6a0 libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
The free dpa (dimm-physical-address) space calculation reports how much
free space is available with consideration for aliased BLK + PMEM
regions.  Recall that BLK capacity is allocated from high addresses and
PMEM is allocated from low addresses in their respective regions.

nd_region_available_dpa() accounts for the fact that the largest
encroachment (lowest starting address) into PMEM capacity by a BLK
allocation limits the available capacity to that point, regardless if
there is BLK allocation hole at a higher address.  Similarly, for the
multi-pmem case we need to track the largest encroachment (highest
 ending address) of a PMEM allocation in BLK capacity regardless of
whether there is an allocation hole that a BLK allocation could fill at
a lower address.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:20:53 -07:00
Dan Williams 6ff3e912d3 libnvdimm, namespace: sort namespaces by dpa at init
Add more determinism to initial namespace device-name assignments by
sorting the namespaces by starting dpa.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:20:53 -07:00
Dan Williams 0e3b0d123c libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time
If label scanning finds multiple valid pmem namespaces allow them to be
surfaced rather than fail namespace scanning. Support for creating
multiple namespaces per region is saved for a later patch.

Note that this adds some new error messages to clarify which of the pmem
namespaces in the set are potentially impacted by invalid labels.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07 09:20:53 -07:00
Dan Williams 8a5f50d3b7 libnvdimm, namespace: unify blk and pmem label scanning
In preparation for allowing multiple namespace per pmem region, unify
blk and pmem label scanning.  Given that blk regions already support
multiple namespaces, teaching that path how to do pmem namespace
scanning is an incremental step towards multiple pmem namespace support.
This should be functionally equivalent to the previous state in that
stops after finding the first valid pmem label set.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-05 20:24:18 -07:00
Dan Williams f95b4bca9e libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper
The ability to translate a generic struct device pointer into a
namespace uuid is a useful utility as we go to unify the blk and pmem
label scanning paths.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-05 20:24:18 -07:00
Dan Williams ae8219f186 libnvdimm, label: convert label tracking to a linked list
In preparation for enabling multiple namespaces per pmem region, convert
the label tracking to use a linked list.  In particular this will allow
select_pmem_id() to move labels from the unvalidated state to the
validated state.  Currently we only track one validated set per-region.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 19:13:42 -07:00
Dan Williams 44c462eb9e libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
Before we add more libnvdimm-private fields to nd_mapping make it clear
which parameters are input vs libnvdimm internals. Use struct
nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make
struct nd_mapping private to libnvdimm.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 19:13:42 -07:00
Dave Jiang db58028ee4 nvdimm: reduce duplicated wpq flushes
Existing implemenetation writes to all the flush hint addresses for a
given ND region. This is not necessary as the flushes are per imc and
not per DIMM. Search the mappings and clear out the duplicates at init
to avoid multiple flush to the same imc.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 19:12:52 -07:00
Vishal Verma e046114af5 libnvdimm: clear the internal poison_list when clearing badblocks
nvdimm_clear_poison cleared the user-visible badblocks, and sent
commands to the NVDIMM to clear the areas marked as 'poison', but it
neglected to clear the same areas from the internal poison_list which is
used to marshal ARS results before sorting them by namespace. As a
result, once on-demand ARS functionality was added:

37b137f nfit, libnvdimm: allow an ARS scrub to be triggered on demand

A scrub triggered from either sysfs or an MCE was found to be adding
stale entries that had been cleared from gendisk->badblocks, but were
still present in nvdimm_bus->poison_list. Additionally, the stale entries
could be triggered into producing stale disk->badblocks by simply disabling
and re-enabling the namespace or region.

This adds the missing step of clearing poison_list entries when clearing
poison, so that it is always in sync with badblocks.

Fixes: 37b137f ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand")
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 17:03:45 -07:00
Vishal Verma bd697a80c3 pmem: reduce kmap_atomic sections to the memcpys only
pmem_do_bvec used to kmap_atomic at the begin, and only unmap at the
end. Things like nvdimm_clear_poison may want to do nvdimm subsystem
bookkeeping operations that may involve taking locks or doing memory
allocations, and we can't do that from the atomic context. Reduce the
atomic context to just what needs it - the memcpy to/from pmem.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-30 17:03:45 -07:00
Dan Williams 595c73071e libnvdimm, region: fix flush hint table thinko
The definition of the flush hint table as:

	void __iomem *flush_wpq[0][0];

...passed the unit test, but is broken as flush_wpq[0][1] and
flush_wpq[1][0] refer to the same entry.  Fix this to use a helper that
calculates a slot in the table based on the geometry of flush hints in
the region.  This is important to get right since virtualization
solutions use this mechanism to trigger hypervisor flushes to platform
persistence.

Reported-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-24 11:45:38 -07:00
Dave Jiang a0056afe21 nvdimm: remove duplicate nd_mapping declaration
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-21 15:28:29 -07:00
Dan Williams 4765218db7 libnvdimm, namespace: debug invalid interleave-set-cookie values
If platform firmware fails to populate unique / non-zero serial number
data for each nvdimm in an interleave-set it may cause pmem region
initialization to fail.  Add a debug message for this case.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-21 09:36:36 -07:00
Dan Williams ecfb6d8a04 libnvdimm: fix devm_nvdimm_memremap() error path
The internal alloc_nvdimm_map() helper might fail, particularly if the
memory region is already busy.  Report request_mem_region() failures and
check for the failure.

Reported-by: Ryan Chen <ryan.chan105@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-21 09:35:15 -07:00
Oliver O'Halloran 480b6837aa nvdimm: fix PHYS_PFN/PFN_PHYS mixup
nd_activate_region() iomaps any hint addresses required when activating
a region. To prevent duplicate mappings it checks the PFN of the hint to
be mapped against the PFNs of the already mapped hints. Unfortunately it
doesn't convert the PFN back into a physical address before passing it
to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time
which ends about as well as you would imagine.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-19 08:54:27 -07:00
Dave Jiang 1e8b8d9619 libnvdimm: allow legacy (e820) pmem region to clear bad blocks
Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation
where legacy pmem is being used or a pmem region created by using memmap
kernel parameter, the injected bad blocks are not cleared due to
nvdimm_clear_poison() failing from lack of ndctl function pointer. In
this case we need to just return as handled and allow the bad blocks to
be cleared rather than fail.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09 17:34:46 -07:00
Toshi Kani aee6598748 libnvdimm: Fix nvdimm_probe error on NVDIMM-N
'ndctl list --buses --dimms' does not list any NVDIMM-Ns since
they are considered as idle.  ndctl checks if any driver is
attached to nmem device.  nvdimm_probe() always fails in
nvdimm_init_nsarea() since NVDIMM-Ns do not implement optinal
ND_CMD_GET_CONFIG_DATA command.

Change nvdimm_probe() to accept the case that the CONFIG_DATA
command is not implemented for NVDIMM-Ns.  The driver attaches
without ndd, which keeps it no-op to the device.

Reported-by: Brian Boylston <brian.boylston@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-01 18:20:39 -07:00
Geert Uytterhoeven ae551e9ca2 nvdimm: Spelling s/unacknoweldged/unacknowledged/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-01 18:20:39 -07:00
Dan Williams ba9c8dd3c2 acpi, nfit: add dimm device notification support
Per "ACPI 6.1 Section 9.20.3" NVDIMM devices, children of the ACPI0012
NVDIMM Root device, can receive health event notifications.

Given that these devices are precluded from registering a notification
handler via acpi_driver.acpi_device_ops (due to no _HID), we use
acpi_install_notify_handler() directly.  The registered handler,
acpi_nvdimm_notify(), triggers a poll(2) event on the nmemX/nfit/flags
sysfs attribute when a health event notification is received.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-08-29 14:55:17 -07:00
Vishal Verma abe8b4e3ce nvdimm, btt: add a size attribute for BTTs
To be consistent with other namespaces, expose a 'size' attribute for
BTT devices also.

Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-08-08 09:26:14 -07:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Jens Axboe c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Linus Torvalds f0c98ebc57 libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement either
    ADR, or provide one or more flush addresses per nvdimm. ADR
    (Asynchronous DRAM Refresh) flushes data in posted write buffers to the
    memory controller on a power-fail event. Flush addresses are defined in
    ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
    "Flush Hint Address Structure". A flush hint is an mmio address that
    when written and fenced assures that all previous posted writes
    targeting a given dimm have been flushed to media.
 
 2/ On-demand ARS (address range scrub):
    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices.  When latent errors are detected we re-scrub the media
    to refresh the bad block list, userspace can also request a re-scrub at
    any time.
 
 3/ Support for the Microsoft DSM (device specific method) command format.
 
 4/ Support for EDK2/OVMF virtual disk device memory ranges.
 
 5/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
 NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
 ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
 trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
 KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
 qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
 WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
 DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
 b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
 cN3edFVIdxvZeMgM5Ubq
 =xCBG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

 - Replace pcommit with ADR / directed-flushing.

   The pcommit instruction, which has not shipped on any product, is
   deprecated.  Instead, the requirement is that platforms implement
   either ADR, or provide one or more flush addresses per nvdimm.

   ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
   to the memory controller on a power-fail event.

   Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
   Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
   A flush hint is an mmio address that when written and fenced assures
   that all previous posted writes targeting a given dimm have been
   flushed to media.

 - On-demand ARS (address range scrub).

   Linux uses the results of the ACPI ARS commands to track bad blocks
   in pmem devices.  When latent errors are detected we re-scrub the
   media to refresh the bad block list, userspace can also request a
   re-scrub at any time.

 - Support for the Microsoft DSM (device specific method) command
   format.

 - Support for EDK2/OVMF virtual disk device memory ranges.

 - Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
  libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
  nfit: do an ARS scrub on hitting a latent media error
  nfit: move to nfit/ sub-directory
  nfit, libnvdimm: allow an ARS scrub to be triggered on demand
  libnvdimm: register nvdimm_bus devices with an nd_bus driver
  pmem: clarify a debug print in pmem_clear_poison
  x86/insn: remove pcommit
  Revert "KVM: x86: add pcommit support"
  nfit, tools/testing/nvdimm/: unify shutdown paths
  libnvdimm: move ->module to struct nvdimm_bus_descriptor
  nfit: cleanup acpi_nfit_init calling convention
  nfit: fix _FIT evaluation memory leak + use after free
  tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
  tools/testing/nvdimm: add virtual ramdisk range
  acpi, nfit: treat virtual ramdisk SPA as pmem region
  pmem: kill __pmem address space
  pmem: kill wmb_pmem()
  libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
  fs/dax: remove wmb_pmem()
  libnvdimm, pmem: flush posted-write queues on shutdown
  ...
2016-07-28 17:38:16 -07:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Dan Williams 0606263f24 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next 2016-07-24 08:05:44 -07:00
Markus Elfring d4c5725d57 libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
The __nd_device_register() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-24 08:04:55 -07:00
Vishal Verma 37b137ff8c nfit, libnvdimm: allow an ARS scrub to be triggered on demand
Normally, an ARS (Address Range Scrub) only happens at
boot/initialization time. There can however arise situations where a
bus-wide rescan is needed - notably, in the case of discovering a latent
media error, we should do a full rescan to figure out what other sectors
are bad, and thus potentially avoid triggering an mce on them in the
future. Also provide a sysfs trigger to start a bus-wide scrub.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 21:51:42 -07:00
Dan Williams 18515942d6 libnvdimm: register nvdimm_bus devices with an nd_bus driver
A recent effort to add a new nvdimm bus provider attribute highlighted a
race between interrogating nvdimm_bus->nd_desc and nvdimm_bus tear down.
The typical way to handle these races is to take the device_lock() in
the attribute method and validate that the device is still active.  In
order for a device to be 'active' it needs to be associated with a
driver.  So, we create the small boilerplate for a driver and register
nvdimm_bus devices on the 'nvdimm_bus_type' bus.

A result of this change is that ndbusX devices now appear under
/sys/bus/nd/devices.  In fact this makes /sys/class/nd somewhat
redundant, but removing that will need to take a long deprecation period
given its use by ndctl binaries in the field.

This change naturally pulls code from drivers/nvdimm/core.c to
drivers/nvdimm/bus.c, so it is a nice code organization clean-up as
well.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 11:06:33 -07:00
Vishal Verma 5bf0b6e1af pmem: clarify a debug print in pmem_clear_poison
Prefix the sector number being cleared with a '0x' to make it clear that
this is a hex value.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 11:06:33 -07:00
Dan Williams bc9775d869 libnvdimm: move ->module to struct nvdimm_bus_descriptor
Let the provider module be explicitly passed in rather than implicitly
assumed by the module that calls nvdimm_bus_register().  This is in
preparation for unifying the nfit and nfit_test driver teardown paths.

Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-21 20:03:19 -07:00
Toshi Kani 163d4baaeb block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:01:01 -06:00
Dan Williams 7a9eb20666 pmem: kill __pmem address space
The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem().  Now that
wmb_pmem() is gone, there is little need to keep this annotation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12 19:25:38 -07:00
Dan Williams 91131dbd1d libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
nsio_rw_bytes() is used to write info block metadata to the namespace,
so it should trigger a flush after every write.  Replace wmb_pmem() with
nvdimm_flush() in this path.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12 15:13:48 -07:00
Dan Williams 476f848aae libnvdimm, pmem: flush posted-write queues on shutdown
Commit writes to media on system shutdown or pmem driver unload.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12 15:13:48 -07:00
Dan Williams 7e267a8c79 libnvdimm, pmem: use REQ_FUA, REQ_FLUSH for nvdimm_flush()
Given that nvdimm_flush() has higher overhead than wmb_pmem() (pointer
chasing through nd_region), and that we otherwise assume a platform has
ADR capability when flush hints are not present, move nvdimm_flush() to
REQ_FLUSH context.

Note that we still arrange for nvdimm_flush() to be called even in the
ADR case. We need at least once wmb() fence to push buffered writes in
the cpu out to the ADR protected domain.

Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 16:16:03 -07:00
Dan Williams 0c27af60d1 libnvdimm: cycle flush hints
When the NFIT provides multiple flush hint addresses per-dimm it is
expressing that the platform is capable of processing multiple flush
requests in parallel.  There is some fixed cost per flush request, let
the cost be shared in parallel on multiple cpus.

Since there may not be enough flush hint addresses for each cpu to have
one, keep a per-cpu index of the last used hint, hash it with current
pid, and assume that access pattern and scheduler randomness will keep
the flush-hint usage somewhat staggered across cpus.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 16:16:03 -07:00
Dan Williams f284a4f237 libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()
nvdimm_flush() is a replacement for the x86 'pcommit' instruction.  It is
an optional write flushing mechanism that an nvdimm bus can provide for
the pmem driver to consume.  In the case of the NFIT nvdimm-bus-provider
nvdimm_flush() is implemented as a series of flush-hint-address [1]
writes to each dimm in the interleave set (region) that backs the
namespace.

The nvdimm_has_flush() routine relies on platform firmware to describe
the flushing capabilities of a platform.  It uses the heuristic of
whether an nvdimm bus provider provides flush address data to return a
ternary result:

      1: flush addresses defined
      0: dimm topology described without flush addresses (assume ADR)
 -errno: no topology information, unable to determine flush mechanism

The pmem driver is expected to take the following actions on this ternary
result:

      1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown
      0: do not set, WC or FUA on the queue, take no further action
 -errno: warn and then operate as if nvdimm_has_flush() returned '0'

The caveat of this heuristic is that it can not distinguish the "dimm
does not have flush address" case from the "platform firmware is broken
and failed to describe a flush address".  Given we are already
explicitly trusting the NFIT there's not much more we can do beyond
blacklisting broken firmwares if they are ever encountered.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 16:13:42 -07:00
Dan Williams a8f720224e libnvdimm: keep region data alive over namespace removal
nd_region device driver data will be used in the namespace i/o path.
Re-order nd_region_remove() to ensure this data stays live across
namespace device removal

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 16:13:41 -07:00
Dan Williams e5ae3b252c libnvdimm, nfit: move flush hint mapping to region-device driver-data
In preparation for triggering flushes of a DIMM's writes-posted-queue
(WPQ) via the pmem driver move mapping of flush hint addresses to the
region driver.  Since this uses devm_nvdimm_memremap() the flush
addresses will remain mapped while any region to which the dimm belongs
is active.

We need to communicate more information to the nvdimm core to facilitate
this mapping, namely each dimm object now carries an array of flush hint
address resources.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 15:09:26 -07:00
Dan Williams a8a6d2e04c libnvdimm, nfit: remove nfit_spa_map() infrastructure
Now that all shared mappings are handled by devm_nvdimm_memremap() we no
longer need nfit_spa_map() nor do we need to trigger a callback to the
bus provider at region disable time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-11 15:09:26 -07:00
Dan Williams 29b9aa0aa3 libnvdimm: introduce devm_nvdimm_memremap(), convert nfit_spa_map() users
In preparation for generically mapping flush hint addresses for both the
BLK and PMEM use case, provide a generic / reference counted mapping
api.  Given the fact that a dimm may belong to multiple regions (PMEM
and BLK), the flush hint addresses need to be held valid as long as any
region associated with the dimm is active.  This is similar to the
existing BLK-region case where multiple BLK-regions may share an
aperture mapping.  Up-level this shared / reference-counted mapping
capability from the nfit driver to a core nvdimm capability.

This eliminates the need for the nd_blk_region.disable() callback.  Note
that the removal of nfit_spa_map() and related infrastructure is
deferred to a later patch.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-07 17:11:09 -07:00
Johannes Thumshirn 8729bdea82 libnvdimm: initialize struct blk_integrity with 0
Initialize struct blk_integrity with 0 as blk_integrity_register() takes the
then unitialized struct blk_integrity::flags and ORs it to the resulting block
integrity structure.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-06 10:34:42 -07:00
Dan Williams 52c44d93c2 block: remove ->driverfs_dev
Now that all drivers that specify a ->driverfs_dev have been converted
to device_add_disk(), the pointer can be removed from struct gendisk.

Cc: Jens Axboe <axboe@fb.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27 12:26:08 -07:00
Dan Williams 0d52c756a6 block: convert to device_add_disk()
For block drivers that specify a parent device, convert them to use
device_add_disk().

This conversion was done with the following semantic patch:

    @@
    struct gendisk *disk;
    expression E;
    @@

    - disk->driverfs_dev = E;
    ...
    - add_disk(disk);
    + device_add_disk(E, disk);

    @@
    struct gendisk *disk;
    expression E1, E2;
    @@

    - disk->driverfs_dev = E1;
    ...
    E2 = disk;
    ...
    - add_disk(E2);
    + device_add_disk(E1, E2);

...plus some manual fixups for a few missed conversions.

Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27 12:26:08 -07:00
Dan Williams f295e53b60 libnvdimm, pmem: allow nfit_test to override pmem_direct_access()
Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to
override it and indicate that nfit_test-pmem is not device-mapped.  Now,
we want to enable nfit_test to operate without DMA_CMA and the pmem it
provides will no longer be physically contiguous, i.e. won't be capable
of supporting direct_access requests larger than a page.  Make
pmem_direct_access() a weak symbol so that it can be replaced by the
tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static
inline now that it no longer needs to be overridden.

Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-24 11:39:29 -07:00
Dan Williams 1ee6667cd8 libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment
The updated ndctl unit tests discovered that if a pfn configuration with
a 4K alignment is read from the namespace, that alignment will be
ignored in favor of the default 2M alignment.  The result is that the
configuration will fail initialization with a message like:

    dax6.1: bad offset: 0x22000 dax disabled align: 0x200000

Fix this by allowing the alignment read from the info block to override
the default which is 2M not 0 in the autodetect path.  This also fixes a
similar problem with the mode and alignment settings silently being
overwritten by the kernel when userspace has changed it.  We now will
either overwrite the info block if userspace changes the uuid or fail
and warn if a live setting disagrees with the info block.

Cc: <stable@vger.kernel.org>
Cc: Micah Parrish <micah.parrish@hpe.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-23 17:50:39 -07:00
Dan Williams 4258895814 libnvdimm: IS_ERR() usage cleanup
Prompted by commit 287980e49f "remove lots of IS_ERR_VALUE abuses", I
ran make coccicheck against drivers/nvdimm/ and found that:

	if (IS_ERR(x))
		return PTR_ERR(x);
	return 0;

...can be replaced with PTR_ERR_OR_ZERO().

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-17 16:23:23 -07:00
Dan Williams f02716db95 libnvdimm: use devm_add_action_or_reset()
Clean up needless calls to the action routine by letting
devm_add_action_or_reset() call it automatically.  This does cause the
disk to registered and immediately unregistered when a memory allocation
fails, but the block layer should be prepared for such an event.

Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-15 14:59:17 -07:00
Linus Torvalds 315227f6da DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on
   any device. This enables the use of DAX in the presence of these
   errors by making all sector-aligned zeroing go through the driver.
 - The driver (already) has the ability to clear errors on writes that
   are sent through the block layer using 'DSMs' defined in ACPI 6.1.
 
 Other misc changes:
 
 - When mounting DAX filesystems, check to make sure the partition
   is page aligned. This is a requirement for DAX, and previously, we
   allowed such unaligned mounts to succeed, but subsequent reads/writes
   would fail.
 
 - Misc/cleanup fixes from Jan that remove unused code from DAX related to
   zeroing, writeback, and some size checks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
 GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
 lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
 aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
 vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
 9R6JdrAj0Sc+FBa+cGzH
 =v6Of
 -----END PGP SIGNATURE-----

Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Dan Williams 36092ee8ba Merge branch 'for-4.7/dax' into libnvdimm-for-next 2016-05-21 12:33:04 -07:00
Dan Williams 03dca343af libnvdimm, dax: fix deletion
The ndctl unit tests discovered that the dax enabling omitted updates to
nd_detach_and_reset().  This routine clears device the configuration
when the namespace is detached.  Without this clearing userspace may
assume that the device is in the process of being configured by another
agent in the system.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-21 12:22:41 -07:00
Dan Williams 5e24c9fd36 libnvdimm, dax: fix alignment validation
Testing the dax-device autodetect support revealed a probe failure with
the following result:

    dax0.1: bad offset: 0x8200000 dax disabled

The original pfn-device implementation inferred the alignment from
ilog2(offset), now that the alignment is explicit the is_power_of_2()
needs replacing with a real sanity check against the recorded alignment.
Otherwise the alignment check is useless in the implicit case and only
the minimum size of the offset matters.

This self-consistency check is further validated by the probe path that
will re-check that the offset is large enough to contain all the
metadata required to enable the device.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-21 11:01:41 -07:00
Dan Williams c5ed926864 libnvdimm, dax: autodetect support
For autodetecting a previously established dax configuration we need the
info block to indicate block-device vs device-dax mode, and we need to
have the default namespace probe hand-off the configuration to the
dax_pmem driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20 22:02:57 -07:00
Dan Williams b354aba016 libnvdimm: release ida resources
ida instances allocate some internal memory for ->free_bitmap in
addition to the base 'struct ida'.  Use ida_destroy() to release that
memory at module_exit().

Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20 22:02:56 -07:00
Dan Williams 0a70bd4305 dax: enable dax in the presence of known media errors (badblocks)
1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18 12:16:56 -06:00
Dan Williams 1f716d05f8 Merge branch 'for-4.7/dsm' into libnvdimm-for-next 2016-05-18 10:06:59 -07:00
Dan Williams 2159669f58 Merge branch 'for-4.7/libnvdimm' into libnvdimm-for-next 2016-05-18 10:06:48 -07:00
Dan Williams 594d6d96ea Merge branch 'for-4.7/dax' into libnvdimm-for-next 2016-05-18 09:59:34 -07:00
Dan Williams 6cf9c5babd libnvdimm: stop requiring a driver ->remove() method
The dax_pmem driver was implementing an empty ->remove() method to
satisfy the nvdimm bus driver that unconditionally calls ->remove().
Teach the core bus driver to check if ->remove() is NULL to remove that
requirement.

Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-18 09:13:13 -07:00
Dan Williams 45a0dac045 libnvdimm, dax: record the specified alignment of a dax-device instance
We want to use the alignment as the allocation and mapping unit.
Previously this information was only useful for establishing the data
offset, but now it is important to remember the granularity for the
later use.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-09 15:35:42 -07:00
Dan Williams 52ac23b25e libnvdimm, dax: reserve space to store labels for device-dax
We may want to subdivide a device-dax range into multiple devices so
that each can have separate permissions or naming.  Reserve 128K of
label space by default so we have the capability of making allocation
decisions persistent.  This reservation is not something we can add
later since it would result in the default size of a device-dax range
changing between kernel versions.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-09 15:35:42 -07:00
Dan Williams cd03412a51 libnvdimm, dax: introduce device-dax infrastructure
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
mapped without need of an intervening file system.  This initial
infrastructure arranges for a libnvdimm pfn-device to be represented as
a different device-type so that it can be attached to a driver other
than the pmem driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-09 15:35:42 -07:00
Dan Williams 1b8d2afde5 libnvdimm, pfn: fix ARCH=alpha allmodconfig build failure
I had relied on the kbuild robot for cross build coverage, however it
only builds alpha_defconfig.  Switch from HPAGE_SIZE to PMD_SIZE, which
is more widely defined.

Fixes: 658922e57b ("libnvdimm, pfn: fix memmap reservation sizing")
Cc: <stable@vger.kernel.org>
Reported-by: Guenter Roeck <guenter@roeck-us.net>
Tested-by: Guenter Roeck <guenter@roeck-us.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-06 10:20:10 -07:00
Dan Williams 658922e57b libnvdimm, pfn: fix memmap reservation sizing
When configuring a pfn-device instance to allocate the memmap array it
needs to account for the fact that vmemmap_populate_hugepages()
allocates struct page blocks in HPAGE_SIZE chunks.  We need to align the
reserved area size to 2MB otherwise arch_add_memory() runs out of memory
while establishing the memmap:

 WARNING: CPU: 0 PID: 496 at arch/x86/mm/init_64.c:704 arch_add_memory+0xe7/0xf0
 [..]
 Call Trace:
  [<ffffffff8148bdb3>] dump_stack+0x85/0xc2
  [<ffffffff810a749b>] __warn+0xcb/0xf0
  [<ffffffff810a75cd>] warn_slowpath_null+0x1d/0x20
  [<ffffffff8106a497>] arch_add_memory+0xe7/0xf0
  [<ffffffff811d2097>] devm_memremap_pages+0x287/0x450
  [<ffffffff811d1ffa>] ? devm_memremap_pages+0x1ea/0x450
  [<ffffffffa0000298>] __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
  [<ffffffffa0047a58>] pmem_attach_disk+0x318/0x420 [nd_pmem]
  [<ffffffffa0047bcf>] nd_pmem_probe+0x6f/0x90 [nd_pmem]
  [<ffffffffa0009469>] nvdimm_bus_probe+0x69/0x110 [libnvdimm]
 [..]
  ndbus0: nd_pmem.probe(pfn3.0) = -12
 nd_pmem: probe of pfn3.0 failed with error -12
libndctl: ndctl_pfn_enable: pfn3.0: failed to enable

Reported-by: Namratha Kothapalli <namratha.n.kothapalli@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-30 13:07:06 -07:00
Dan Williams 31eca76ba2 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
There are currently 4 known similar but incompatible definitions of the
command sets that can be sent to an NVDIMM through ACPI.  It is also
clear that future platform generations (ACPI or not) will continue to
revise and extend the DIMM command set as new devices and use cases
arrive.

It is obviously untenable to continue to proliferate divergence
of these command definitions, and to that end a standardization process
has begun to provide for a unified specification.  However, that leaves a
problem about what to do with this first generation where vendors are
already shipping divergence.

The Linux kernel can support these initial diverged platforms without
giving platform-firmware free reign to continue to diverge and compound
kernel maintenance overhead.  The kernel implementation can encourage
standardization in two ways:

1/ Require that any function code that userspace wants to send be
   explicitly white-listed in the implementation.  For ACPI this means
   function codes marked as supported by acpi_check_dsm() may
   only be invoked if they appear in the white-list.  A function must be
   publicly documented before it is added to the white-list.

2/ The above restrictions can be trivially bypassed by using the
   "vendor-specific" payload command.  However, since vendor-specific
   commands are by definition not publicly documented and have the
   potential to corrupt the kernel's view of the dimm state, we provide a
   toggle to disable vendor-specific operations.  Enabling undefined
   behavior is a policy decision that can be made by the platform owner
   and encourages firmware implementations to choose public over
   private command implementations.

Based on an initial patch from Jerry Hoemann
Cc: Jerry Hoemann <jerry.hoemann@hpe.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-28 16:59:06 -07:00
Dan Williams e3654eca70 nfit, libnvdimm: clarify "commands" vs "_DSMs"
Clarify the distinction between "commands", the ioctls userspace calls
to request the kernel take some action on a given dimm device, and
"_DSMs", the actual function numbers used in the firmware interface to
the DIMM.  _DSMs are ACPI specific whereas commands are Linux kernel
generic.

This is in preparation for breaking the 1:1 implicit relationship
between the kernel ioctl number space and the firmware specific function
numbers.

Cc: Jerry Hoemann <jerry.hoemann@hpe.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-28 16:23:16 -07:00
Dan Williams 0bfb8dd3ed libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'
The 'host' variable can be killed as it is always the same as the passed
in device.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:24 -07:00
Dan Williams 5a92289f41 libnvdimm, pmem: kill ->pmem_queue and ->pmem_disk
The devm conversion obviates the need to continue to remember the queue
and disk locally in the driver.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:24 -07:00
Dan Williams ac515c084b libnvdimm, pmem, pfn: move pfn setup to the core
Now that pmem internals have been disentangled from pfn setup, that code
can move to the core.  This is in preparation for adding another user of
the pfn-device capabilities.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams 200c79da82 libnvdimm, pmem, pfn: make pmem_rw_bytes generic and refactor pfn setup
In preparation for providing an alternative (to block device) access
mechanism to persistent memory, convert pmem_rw_bytes() to
nsio_rw_bytes().  This allows ->rw_bytes() functionality without
requiring a 'struct pmem_device' to be instantiated.

In other words, when ->rw_bytes() is in use i/o is driven through
'struct nd_namespace_io', otherwise it is driven through 'struct
pmem_device' and the block layer.  This consolidates the disjoint calls
to devm_exit_badblocks() and devm_memunmap() into a common
devm_nsio_disable() and cleans up the init path to use a unified
pmem_attach_disk() implementation.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams 947df02d25 libnvdimm, pmem: clean up resource print / request
The leading '0x' in front of %pa is redundant, also we can just use %pR
to simplify the print statement.  The request parameters can be directly
taken from the resource as well.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams 030b99e39c libnvdimm, pmem: use devm_add_action to release bdev resources
Register a callback to clean up the request_queue and put the gendisk at
driver disable time.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams 9d90725ddc libnvdimm, blk: move i/o infrastructure to nd_namespace_blk
Consolidate the information for issuing i/o to a blk-namespace, and
eliminate some pointer chasing.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams 8378af17a4 libnvdimm, blk: quiet i/o error reporting
I/O errors events have the potential to be a high frequency and a log
message for each event can swamp the system.  This message is also
redundant with upper layer error reporting.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams bd842b8ca7 libnvdimm, pmem: use ->queuedata for driver private data
Save a pointer chase by storing the driver private data in the
request_queue rather than the gendisk.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams d44077a7cd libnvdimm, blk: use ->queuedata for driver private data
Save a pointer chase by storing the driver private data in the
request_queue rather than the gendisk.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:22 -07:00
Dan Williams d29cee120e libnvdimm, blk: use devm_add_action to release bdev resources
Register a callback to clean up the request_queue and put the gendisk at
driver disable time.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:22 -07:00
Dan Williams 9dec4892ca libnvdimm, btt: add btt startup debug
Report the reason for btt probe failures when debug is enabled.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:05 -07:00
Dan Williams e32bc729a3 libnvdimm, btt, convert nd_btt_probe() to devm
Pass the device performing the probe so we can use a devm allocation for
the btt superblock.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams bd032943b5 libnvdimm, pfn, convert nd_pfn_probe() to devm
Pass the device performing the probe so we can use a devm allocation for
the pfn superblock.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams 298f2bc5db libnvdimm, pmem: kill pmem->ndns
We can derive the common namespace from other information.  We also do
not need to cache it because all the usages are in slow paths.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams 0a370d261c libnvdimm, pmem: clarify the write+clear_poison+write flow
The ACPI specification does not specify the state of data after a clear
poison operation.  Potential future libnvdimm bus implementations for
other architectures also might not specify or disagree on the state of
data after clear poison.  Clarify why we write twice.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
2016-04-15 14:59:41 -06:00
Dan Williams baa51277cf libnvdimm, test: add mock SMART data payload
Provide simulated SMART data to enable the ndctl implementation of SMART
data retrieval and parsing.

The payload is defined here, "Section 4.1 SMART and Health Info
(Function Index 1)":

    http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-11 11:11:14 -07:00
Linus Torvalds 239467e852 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Three fixes, the first two are tagged for -stable:

   - The ndctl utility/library gained expanded unit tests illuminating a
     long standing bug in the libnvdimm SMART data retrieval
     implementation.

     It has been broken since its initial implementation, now fixed.

   - Another one line fix for the detection of stale info blocks.

     Without this change userspace can get into a situation where it is
     unable to reconfigure a namespace.

   - Fix the badblock initialization path in the presence of the new (in
     v4.6-rc1) section alignment workarounds.

     Without this change badblocks will be reported at the wrong offset.

  These have received a build success report from the kbuild robot and
  have appeared in -next with no reported issues"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment
  libnvdimm, pfn: fix uuid validation
  libnvdimm: fix smart data retrieval
2016-04-09 14:05:45 -07:00
Dan Williams a390180291 libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment
When section alignment padding is in effect we need to shift / truncate
the range that is queried for poison by the 'start_pad' or 'end_trunc'
reservations.

It's easiest if we just pass in an adjusted resource range rather than
deriving it from the passed in namespace.  With the resource range
resolution pushed out to the caller we can also push the
namespace-to-region lookup to the caller and drop the implicit pmem-type
assumption about the passed in namespace object.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-07 20:02:06 -07:00
Dan Williams e5670563f5 libnvdimm, pfn: fix uuid validation
If we detect a namespace has a stale info block in the init path, we
should overwrite with the latest configuration.  In fact, we already
return -ENODEV when the parent uuid is invalid, the same should be done
for the 'self' uuid.  Otherwise we can get into a condition where
userspace is unable to reconfigure the pfn-device without directly /
manually invalidating the info block.

Cc: <stable@vger.kernel.org>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-07 19:59:27 -07:00
Dan Williams 2112911266 libnvdimm: fix smart data retrieval
It appears that smart data retrieval has been broken the since the
initial implementation.  Fix the payload size to be 128-bytes per the
specification.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-07 19:58:44 -07:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Dan Williams fc0c202813 x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()
Update the definition of memcpy_from_pmem() to return 0 or a negative
error code.  Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().

Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-28 17:19:31 -07:00
Linus Torvalds de06dbfa78 Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:
 "Another mixture of changes this time around:

   - Split XIP linker file from main linker file to make it more
     maintainable, and various XIP fixes, and clean up a resulting
     macro.

   - Decompressor cleanups from Masahiro Yamada

   - Avoid printing an error for a missing L2 cache

   - Remove some duplicated symbols in System.map, and move
     vectors/stubs back into kernel VMA

   - Various low priority fixes from Arnd

   - Updates to allow bus match functions to return negative errno
     values, touching some drivers and the driver core.  Greg has acked
     these changes.

   - Virtualisation platform udpates form Jean-Philippe Brucker.

   - Security enhancements from Kees Cook

   - Rework some Kconfig dependencies and move PSCI idle management code
     out of arch/arm into drivers/firmware/psci.c

   - ARM DMA mapping updates, touching media, acked by Mauro.

   - Fix places in ARM code which should be using virt_to_idmap() so
     that Keystone2 can work.

   - Fix Marvell Tauros2 to work again with non-DT boots.

   - Provide a delay timer for ARM Orion platforms"

* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (45 commits)
  ARM: 8546/1: dma-mapping: refactor to fix coherent+cma+gfp=0
  ARM: 8547/1: dma-mapping: store buffer information
  ARM: 8543/1: decompressor: rename suffix_y to compress-y
  ARM: 8542/1: decompressor: merge piggy.*.S and simplify Makefile
  ARM: 8541/1: decompressor: drop redundant FORCE in Makefile
  ARM: 8540/1: decompressor: use clean-files instead of extra-y to clean files
  ARM: 8539/1: decompressor: drop more unneeded assignments to "targets"
  ARM: 8538/1: decompressor: drop unneeded assignments to "targets"
  ARM: 8532/1: uncompress: mark putc as inline
  ARM: 8531/1: turn init_new_context into an inline function
  ARM: 8530/1: remove VIRT_TO_BUS
  ARM: 8537/1: drop unused DEBUG_RODATA from XIP_KERNEL
  ARM: 8536/1: mm: hide __start_rodata_section_aligned for non-debug builds
  ARM: 8535/1: mm: DEBUG_RODATA makes no sense with XIP_KERNEL
  ARM: 8534/1: virt: fix hyp-stub build for pre-ARMv7 CPUs
  ARM: make the physical-relative calculation more obvious
  ARM: 8512/1: proc-v7.S: Adjust stack address when XIP_KERNEL
  ARM: 8411/1: Add default SPARSEMEM settings
  ARM: 8503/1: clk_register_clkdev: remove format string interface
  ARM: 8529/1: remove 'i' and 'zi' targets
  ...
2016-03-19 16:31:54 -07:00
Dan Williams 489011652a Merge branch 'for-4.6/pfn' into libnvdimm-for-next 2016-03-09 17:15:43 -08:00
Dan Williams 59e6473980 libnvdimm, pmem: clear poison on write
If a write is directed at a known bad block perform the following:

1/ write the data

2/ send a clear poison command

3/ invalidate the poison out of the cache hierarchy

Cc: <x86@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 15:15:32 -08:00
Dan Williams b5ebc8ec69 libnvdimm, pmem: fix kmap_atomic() leak in error path
When we enounter a bad block we need to kunmap_atomic() before
returning.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 15:12:41 -08:00
NeilBrown ff8e92d5d9 nvdimm/btt: don't allocate unused major device number
alloc_disk(0) does not require or use a ->major number,
all devices are allocated with a major of BLOCK_EXT_MAJOR.

So don't allocate btt_major.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 15:00:24 -08:00
NeilBrown ec56151d38 nvdimm/blk: don't allocate unused major device number
When alloc_disk(0) is used ->major is completely ignored, all devices
are allocated with a "major" of BLOCK_EXT_MAJOR.

So don't allocate nd_blk_major

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 14:59:41 -08:00
NeilBrown 55155291b3 pmem: don't allocate unused major device number
When alloc_disk(0) or alloc_disk-node(0, XX) is used, the ->major
number is completely ignored:  all devices are allocated with a
major of BLOCK_EXT_MAJOR.

So there is no point allocating pmem_major.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:31:34 -08:00
Dan Williams 45f68802f2 libnvdimm, pmem: fix ia64 build, use PHYS_PFN
drivers/nvdimm/pmem.c: In function 'nvdimm_namespace_attach_pfn':
   drivers/nvdimm/pmem.c:367:3: error: implicit declaration of function
   	'__phys_to_pfn' [-Werror=implicit-function-declaration]
   .base_pfn = __phys_to_pfn(nsio->res.start),

ia64 does not provide __phys_to_pfn(), just use the PHYS_PFN() alias.

Cc: Guenter Roeck <linux@roeck-us.net>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-06 08:04:12 -08:00
Dan Williams d4f323672a nfit, libnvdimm: clear poison command support
Add the boiler-plate for a 'clear error' command based on section
9.20.7.6 "Function Index 4 - Clear Uncorrectable Error" from the ACPI
6.1 specification, and add a reference implementation in nfit_test.

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 18:06:14 -08:00
Dan Williams f6ed58c70d libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
Currenty with a raw mode pmem namespace the physical memory address range for
the device can be obtained via /sys/block/pmemX/device/{resource|size}.  Add
similar attributes for pfn instances that takes the struct page memmap and
section padding into account.

Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:45 -08:00
Dan Williams cfe30b8720 libnvdimm, pmem: adjust for section collisions with 'System RAM'
On a platform where 'Persistent Memory' and 'System RAM' are mixed
within a given sparsemem section, trim the namespace and notify about the
sub-optimal alignment.

Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:45 -08:00
Dan Williams d9cbe09d39 libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
The altmap for a section-misaligned namespace needs to arrange for the
base_pfn to be section-aligned.  As a result the 'reserve' region (pfns
from base that do not have a struct page) must be increased.  Otherwise
we trip the altmap validation check in __add_pages:

	if (altmap->base_pfn != phys_start_pfn
			|| vmem_altmap_offset(altmap) > nr_pages) {
		pr_warn_once("memory add fail, invalid altmap\n");
		return -EINVAL;
	}

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:44 -08:00
Jerry Hoemann 07accfa9d1 libnvdimm: Fix security issue with DSM IOCTL.
Code attempts to prevent certain IOCTL DSM from being called
when device is opened read only.  This security feature can
be trivially overcome by changing the size portion of the
ioctl_command which isn't used.

Check only the _IOC_NR (i.e. the command).

Cc: <stable@vger.kernel.org>
Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Jerry Hoemann 4dc0e7be88 libnvdimm: Clean-up access mode check.
Change nd_ioctl and nvdimm_ioctl access mode check to use O_RDONLY.

Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Dan Williams 87bf572e19 nfit: disable userspace initiated ars during scrub
While the nfit driver is issuing address range scrub commands and
reaping the results do not permit an ars_start command issued from
userspace.  The scrub thread assumes that all ars completions are for
scrubs initiated by platform firmware at boot, or by the nfit driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Dan Williams 7ae0fa439f nfit, libnvdimm: async region scrub workqueue
Introduce a workqueue that will be used to run address range scrub
asynchronously with the rest of nvdimm device probing.

Userspace still wants notification when probing operations complete, so
introduce a new callback to flush this workqueue when userspace is
awaiting probe completion.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Dan Williams 719994660c libnvdimm: async notification support
In preparation for asynchronous address range scrub support add an
ability for the pmem driver to dynamically consume address range scrub
results.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Dan Williams 5faecf4eb0 libnvdimm: protect nvdimm_{bus|namespace}_add_poison() with nvdimm_bus_lock()
In preparation for making poison list retrieval asynchronus to region
registration, add protection for walking and mutating the bus-level
poison list.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Dan Williams aef2533822 libnvdimm, nfit: centralize command status translation
The return value from an 'ndctl_fn' reports the command execution
status, i.e. was the command properly formatted and was it successfully
submitted to the bus provider.  The new 'cmd_rc' parameter allows the bus
provider to communicate command specific results, translated into
common error codes.

Convert the ARS commands to this scheme to:

1/ Consolidate status reporting

2/ Prepare for for expanding ars unit test cases

3/ Make the implementation more generic

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Russell King 1b3bf84797 Merge branches 'amba', 'fixes', 'misc' and 'tauros2' into for-next 2016-03-04 23:36:02 +00:00
Ingo Molnar bc94b99636 Linux 4.5-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJW0yM6AAoJEHm+PkMAQRiGeUwIAJRTHFPJTFpJcJjeZEV4/EL1
 7Pl0WSHs/CWBkXIevAg2HgkECSQ9NI9FAUFvoGxCldDpFAnL1U2QV8+Ur2qhiXMG
 5v0jILJuiw57qT/NfhEudZolerlRoHILmB3JRTb+DUV4GHZuWpTkJfUSI9j5aTEl
 w83XUgtK4bKeIyFbHdWQk6xqfzfFBSuEITuSXreOMwkFfMmeScE0WXOPLBZWyhPa
 v0rARJLYgM+vmRAnJjnG8unH+SgnqiNcn2oOFpevKwmpVcOjcEmeuxh/HdeZf7HM
 /R8F86OwdmXsO+z8dQxfcucLg+I9YmKfFr8b6hopu1sRztss2+Uk6H1j2J7IFIg=
 =tvkh
 -----END PGP SIGNATURE-----

Merge tag 'v4.5-rc6' into core/resources, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-04 12:12:08 +01:00
Arnd Bergmann c45442055d nvdimm: use 'u64' for pfn flags
A recent bugfix changed pfn_t to always be 64-bit wide, but did not
change the code in pmem.c, which is now broken on 32-bit architectures
as reported by gcc:

In file included from ../drivers/nvdimm/pmem.c:28:0:
drivers/nvdimm/pmem.c: In function 'pmem_alloc':
include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))

This changes the intermediate pfn_flags in struct pmem_device to
be 64 bit wide as well, so they can store the flags correctly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: db78c22230 ("mm: fix pfn_t vs highmem")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-02-23 17:17:20 -08:00
Dan Williams 4577b06655 nfit: update address range scrub commands to the acpi 6.1 format
The original format of these commands from the "NVDIMM DSM Interface
Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
Device _DSMs" [2].

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
     "9.20.7 NVDIMM Root Device _DSMs"

Changes include:
1/ New 'restart' fields in ars_status, unfortunately these are
   implemented in the middle of the existing definition so this change
   is not backwards compatible.  The expectation is that shipping
   platforms will only ever support the ACPI 6.1 definition.

2/ New status values for ars_start ('busy') and ars_status ('overflow').

Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-02-23 17:17:20 -08:00
Dan Williams 747ffe11b4 libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
Use the output length specified in the command to size the receive
buffer rather than the arbitrary 4K limit.

This bug was hiding the fact that the ndctl implementation of
ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size.

Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-02-19 15:21:52 -08:00
Dan Williams 82ec2ba2b1 ARM: 8522/1: drivers: nvdimm: ensure no negative value gets returned on positive match
This patch ensures that existing bus match callbacks don't return
negative values (which might be interpreted as potential errors in the
future) in case of positive match.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2016-02-16 16:28:50 +00:00
Toshi Kani f0f4711aa1 x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
Change the callers of walk_iomem_res() scanning for the
following resources by name to use walk_iomem_res_desc()
instead.

 "ACPI Tables"
 "ACPI Non-volatile Storage"
 "Persistent Memory (legacy)"
 "Crash kernel"

Note, the caller of walk_iomem_res() with "GART" will be removed
in a later patch.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chun-Yi <joeyli.kernel@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Lee, Chun-Yi <joeyli.kernel@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Takao Indoh <indou.takao@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kexec@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/1453841853-11383-15-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:59 +01:00
Dan Williams 45eb570a0d libnvdimm, pfn: fix restoring memmap location
This path was missed when turning on the memmap in pmem support.  Permit
'pmem' as a valid location for the map.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-29 17:43:16 -08:00
Dan Williams 9c41242817 libnvdimm: fix mode determination for e820 devices
Correctly display "safe" mode when a btt is established on a e820/memmap
defined pmem namespace.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-26 09:40:32 -08:00
Dan Williams 5c2c2587b1 mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 468ded03c0 libnvdimm, pmem: move request_queue allocation earlier in probe
Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown.  Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams d2c0f041e1 libnvdimm, pfn, pmem: allocate memmap array in persistent memory
Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.

[linux@roeck-us.net: mn10300: declare __pfn_to_phys() to fix build error]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 4b94ffdc41 x86, mm: introduce vmem_altmap to augment vmemmap_populate()
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 9476df7d80 mm: introduce find_dev_pagemap()
There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 34c0fd540e mm, dax, pmem: introduce pfn_t
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams 8b63b6bfc1 Merge branch 'for-4.5/block-dax' into for-4.5/libnvdimm 2016-01-10 07:53:55 -08:00
Dan Williams 710d69cc99 libnvdimm, pmem: nvdimm_read_bytes() badblocks support
Support badblock checking in all the pmem read paths that do not go
through the block layer.  This protects info block reads (btt or pfn) as
well as data reads to a pmem namespace via a btt instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 22:42:31 -08:00
Dan Williams 57f7f317ab pmem, dax: disable dax in the presence of bad blocks
Longer term teach dax to punch "error" holes in mapping requests and
deliver SIGBUS to applications that consume a bad pmem page.  For now,
simply disable the dax performance optimization in the presence of known
errors.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 22:42:31 -08:00
Dan Williams e10624f8c0 pmem: fail io-requests to known bad blocks
Check the sectors specified in a read bio to see if they hit a known bad
block, and return an error code pmem_do_bvec().

Note that the ->rw_page() is not in a position to return errors.  For
now, copy the same layering violation present in zram_rw_page() to avoid
crashes of the form:

 kernel BUG at mm/filemap.c:822!
 [..]
 Call Trace:
  [<ffffffff811c540e>] page_endio+0x1e/0x60
  [<ffffffff81290d29>] mpage_end_io+0x39/0x60
  [<ffffffff8141c4ef>] bio_endio+0x3f/0x60
  [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem]

...i.e. unlock a page that was already unlocked via pmem_rw_page() =>
page_endio().

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Dan Williams b95f5f4391 libnvdimm: convert to statically allocated badblocks
If a device will ever have badblocks it should always have a badblocks
instance available.  So, similar to md, embed a badblocks instance in
pmem_device.  This reduces pointer chasing in the i/o fast path, and
simplifies the init path.

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Dan Williams 87ba05dff3 libnvdimm: don't fail init for full badblocks list
If the badblocks list runs out of space it simply means that software is
unable to intercept all errors.  This is no different than the latent
discovery of new badblocks case and should not be an initialization
failure condition.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Dan Williams ad9a8bde2c libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Vishal Verma 0caeef63e6 libnvdimm: Add a poison list and export badblocks
During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Dan Williams e07ecd76d4 libnvdimm: fix namespace object confusion in is_uuid_busy()
When btt devices were re-worked to be child devices of regions this
routine was overlooked.  It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices.  By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
 IP: [<ffffffff814610dc>] memcmp+0xc/0x40
 [..]
 Call Trace:
  [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
  [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
  [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-05 18:37:23 -08:00
Dan Williams 0731de0dd9 libnvdimm, pfn: move 'memory mode' indication to sysfs
'Memory mode' is defined as the capability of a DAX mapping to be the
source/target of DMA and other "direct I/O" scenarios.  While it
currently requires allocating 'struct page' for each page frame of
persistent memory in the namespace it will not always be the case.  Work
continues on reducing the kernel's dependency on 'struct page'.

Let's not maintain a suffix that is expected to lose meaning over time.
In other words a future 'raw mode' pmem namespace may be as capable as
today's 'memory mode' namespace.  Undo the encoding of the mode in the
device name and leave it to other tooling to determine the mode of the
namespace from its attributes.

Reported-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24 12:20:20 -08:00
Dan Williams 3fa9626865 libnvdimm, pfn: fix nd_pfn_validate() return value handling
The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24 12:20:19 -08:00
Dan Williams 2dc43331e3 libnvdimm, pfn: fix pfn seed creation
Similar to btt, plant a new pfn seed when the existing one is activated.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13 11:41:36 -08:00
Dan Williams a34d5e8a6a libnvdimm, pfn: add parent uuid validation
Track and check the uuid of the namespace hosting a pfn instance.  This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13 11:40:20 -08:00
Dan Williams 315c562536 libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE
When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX.  Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().

Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-12 15:04:26 -08:00
Dan Williams f7c6ab80fa libnvdimm, pfn: clean up pfn create parameters
In all cases __nd_pfn_create is called with default parameters which are
then overridden by values in the info block.  Clean up pfn creation by
dropping the parameters and setting default values internal to
__nd_pfn_create.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-10 16:11:59 -08:00
Dan Williams 9f1e8cee77 libnvdimm, pfn: kill ND_PFN_ALIGN
The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-10 15:14:20 -08:00
Dmitry Krivenok 6bb691ac08 nvdimm: do not show pfn_seed for non pmem regions
This simple change hides pfn_seed attribute for non pmem
regions because they don't support pfn anyway.

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-08 16:27:30 -08:00
Dmitry Krivenok bd26d0d0ce nvdimm: improve diagnosibility of namespaces
In order to bind namespace to the driver user must first
set all mandatory attributes in the following order:
- uuid
- size
- sector_size (for blk namespace only)

If the order is wrong, then user either won't be able to set
the attribute or bind the namespace.

This simple patch improves diagnosibility of common operations
with namespaces by printing some details about the error
instead of failing silently.

Below are examples of error messages (assuming dyndbg is
enabled for nvdimms):

[/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size
[  288.372612] nd namespace5.0: __size_store: uuid not set
[  288.374839] nd namespace5.0: size_store: 400000 fail (-6)
sh: write error: No such device or address
[/]#

[/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind
[  554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set
[  554.674688]  ndbus1: nd_blk.probe(namespace5.0) = -19
sh: write error: No such device
[/]#

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-08 16:27:30 -08:00
Dan Williams 589e75d157 libnvdimm, pmem: fix size trim in pmem_direct_access()
This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-12 09:55:23 -08:00
Dan Williams f7256dc0cd libnvdimm, e820: fix numa node for e820-type-12 pmem ranges
Rather than punt on the numa node for these e820 ranges try to find a
better answer with memory_add_physaddr_to_nid() when it is available.

Cc: <stable@vger.kernel.org>
Reported-by: Boaz Harrosh <boaz@plexistor.com>
Tested-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-12 09:21:18 -08:00
Linus Torvalds 3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Linus Torvalds 264015f8a8 libnvdimm for 4.4:
1/ Add support for the ACPI 6.0 NFIT hot add mechanism to process
    updates of the NFIT at runtime.
 
 2/ Teach the coredump implementation how to filter out DAX mappings.
 
 3/ Introduce NUMA hints for allocations made by the pmem driver, and as
    a side effect all devm allocations now hint their NUMA node by
    default.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWQX2sAAoJEB7SkWpmfYgCWsEQAK7w/xM9zClVY/DDlFJxFtYq
 DZJ4faPj+E3FMTiJIEDzjtRgQvOFE+wmJtntYsCqKH/QZmpnyk9jeT/CbJzEEL2k
 WsAk+qHGLcVUlSb36blwN1RFzYqC+IDYThewJqUvxDbOwL1AbiibbX7gplzZHLhW
 +rj3ScVlSNOPRDgGGpkAeLNNsttuKvsGo7nB/JZopm0tV6g14rSK09wQbVhv6S6T
 Lu7VGYqnJlkteL9YlzRiROf9hW2ZFCMGJz1YZydPTy3aX3hGTBX4w2qvmsPwBIKP
 kW/gCNisVJGk1cZCk4joSJ8i/b3x3fE0zdZ5waivJ5jDvYbUUfyk0KtJkfw207Rl
 14yWitUC6aeVuCeOqXHgsjRi+1QVN9Pg7i49xgGiUN1igQiUYRTgQPWZxDv6Zo/s
 USrLFQBaRd+hJw+dl7A47lJ3mUF96tPCoQb4LCQ7DVsg5U4J2TvqXLH9Gek/CCZ4
 QsMkZDTQlZw4+JEDlzBgg/L7xVty8DadplTADMdjaRhFU3y8zKNJ85Ileokt7KVt
 IsBT4+S5HeZLvinZY95932DwAmFp1DtsyENd1BUXL06ddyvlQrFJ6NQaXji4fuDc
 EVQmMoTAqDujZFupMAux9vkUBDFj/hmaVD5F7j3+MWP87OCritw/IZn+2LgTaKoX
 EmttaYrDr2jJwIaGyw+H
 =a2/L
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Outside of the new ACPI-NFIT hot-add support this pull request is more
  notable for what it does not contain, than what it does.  There were a
  handful of development topics this cycle, dax get_user_pages, dax
  fsync, and raw block dax, that need more more iteration and will wait
  for 4.5.

  The patches to make devm and the pmem driver NUMA aware have been in
  -next for several weeks.  The hot-add support has not, but is
  contained to the NFIT driver and is passing unit tests.  The coredump
  support is straightforward and was looked over by Jeff.  All of it has
  received a 0day build success notification across 107 configs.

  Summary:

   - Add support for the ACPI 6.0 NFIT hot add mechanism to process
     updates of the NFIT at runtime.

   - Teach the coredump implementation how to filter out DAX mappings.

   - Introduce NUMA hints for allocations made by the pmem driver, and
     as a side effect all devm allocations now hint their NUMA node by
     default"

* tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  coredump: add DAX filtering for FDPIC ELF coredumps
  coredump: add DAX filtering for ELF coredumps
  acpi: nfit: Add support for hot-add
  nfit: in acpi_nfit_init, break on a 0-length table
  pmem, memremap: convert to numa aware allocations
  devm_memremap_pages: use numa_mem_id
  devm: make allocations numa aware by default
  devm_memremap: convert to return ERR_PTR
  devm_memunmap: use devres_release()
  pmem: kill memremap_pmem()
  x86, mm: quiet arch_add_memory()
2015-11-10 12:07:22 -08:00
Jens Axboe dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Dan Williams 4125a09b0a block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
The libnvidmm-btt and nvme drivers use blk_integrity to reserve space
for per-sector metadata, but sometimes without protection checksums.
This property is generically useful, so teach the block core to
internally specify a nop profile if one is not provided at registration
time.

Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
[hch: kill the local nvme nop profile as well]
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:45 -06:00
Dan Williams 9609b9942b md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
Now that the integrity profile is statically allocated there is no work
to do when shutting down an integrity enabled block device.

Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <JBottomley@Odin.com>
Acked-by: NeilBrown <neilb@suse.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:37 -06:00
Martin K. Petersen 25520d55cd block: Inline blk_integrity in struct gendisk
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:42 -06:00
Martin K. Petersen 0f8087ecde block: Consolidate static integrity profile properties
We previously made a complete copy of a device's data integrity profile
even though several of the fields inside the blk_integrity struct are
pointers to fixed template entries in t10-pi.c.

Split the static and per-device portions so that we can reference the
template directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:38 -06:00
Dan Williams 538ea4aa44 pmem, memremap: convert to numa aware allocations
Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:33 -04:00
Dan Williams b36f47617f devm_memremap: convert to return ERR_PTR
Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:33 -04:00
Dan Williams a639315d6c pmem: kill memremap_pmem()
Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem",  the mapping type is implied.  Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:32 -04:00
Ross Zwisler ba8fe0f85e pmem: add proper fencing to pmem_rw_page()
pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable.  This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952f4 ("arch, x86: pmem api for ensuring durability of
	persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-09-17 11:49:28 -04:00
Axel Lin 4ca8b57a0a libnvdimm: pfn_devs: Fix locking in namespace_store
Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-09-17 11:47:50 -04:00
Axel Lin 4be9c1fc3d libnvdimm: btt_devs: Fix locking in namespace_store
Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Cc: <stable@vger.kernel.org>
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-09-17 11:37:16 -04:00
Linus Torvalds 12f03ee606 libnvdimm for 4.3:
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.  This facility is used by the pmem driver to
    enable pfn_to_page() operations on the page frames returned by DAX
    ('direct_access' in 'struct block_device_operations'). For now, the
    'memmap' allocation for these "device" pages comes from "System
    RAM".  Support for allocating the memmap from device memory will
    arrive in a later kernel.
 
 2/ Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt().  memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects.  The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.  Completion of
    the conversion is targeted for v4.4.
 
 3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.
 
 4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.
 
 5/ Miscellaneous updates and fixes to libnvdimm including support
    for issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
 JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
 OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
 nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
 NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
 zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
 1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
 sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
 bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
 o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
 dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
 slsw6DkrWT60CRE42nbK
 =o57/
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This update has successfully completed a 0day-kbuild run and has
  appeared in a linux-next release.  The changes outside of the typical
  drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
  removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
  the introduction of ZONE_DEVICE + devm_memremap_pages().

  Summary:

   - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
     mechanism for adding device-driver-discovered memory regions to the
     kernel's direct map.

     This facility is used by the pmem driver to enable pfn_to_page()
     operations on the page frames returned by DAX ('direct_access' in
     'struct block_device_operations').

     For now, the 'memmap' allocation for these "device" pages comes
     from "System RAM".  Support for allocating the memmap from device
     memory will arrive in a later kernel.

   - Introduce memremap() to replace usages of ioremap_cache() and
     ioremap_wt().  memremap() drops the __iomem annotation for these
     mappings to memory that do not have i/o side effects.  The
     replacement of ioremap_cache() with memremap() is limited to the
     pmem driver to ease merging the api change in v4.3.

     Completion of the conversion is targeted for v4.4.

   - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
     driver, update the VFS DAX implementation and PMEM api to provide
     persistence guarantees for kernel operations on a DAX mapping.

   - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
     cacheable to improve performance.

   - Miscellaneous updates and fixes to libnvdimm including support for
     issuing "address range scrub" commands, clarifying the optimal
     'sector size' of pmem devices, a clarification of the usage of the
     ACPI '_STA' (status) property for DIMM devices, and other minor
     fixes"

* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
  libnvdimm, pmem: direct map legacy pmem by default
  libnvdimm, pmem: 'struct page' for pmem
  libnvdimm, pfn: 'struct page' provider infrastructure
  x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
  add devm_memremap_pages
  mm: ZONE_DEVICE for "device memory"
  mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
  dax: drop size parameter to ->direct_access()
  nd_blk: change aperture mapping from WC to WB
  nvdimm: change to use generic kvfree()
  pmem, dax: have direct_access use __pmem annotation
  dax: update I/O path to do proper PMEM flushing
  pmem: add copy_from_iter_pmem() and clear_pmem()
  pmem, x86: clean up conditional pmem includes
  pmem: remove layer when calling arch_has_wmb_pmem()
  pmem, x86: move x86 PMEM API to new pmem.h header
  libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  pmem: switch to devm_ allocations
  devres: add devm_memremap
  libnvdimm, btt: write and validate parent_uuid
  ...
2015-09-08 14:35:59 -07:00
Linus Torvalds 1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Dan Williams 004f1afbe1 libnvdimm, pmem: direct map legacy pmem by default
The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory.  Larger capacities will be described via the ACPI
NFIT.  When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 23:40:05 -04:00
Dan Williams 32ab0a3f51 libnvdimm, pmem: 'struct page' for pmem
Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 23:40:04 -04:00
Dan Williams e1455744b2 libnvdimm, pfn: 'struct page' provider infrastructure
Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 23:39:36 -04:00
Dan Williams 96601adb74 x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: Christoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:40:59 -04:00
Dan Williams cb389b9c0e dax: drop size parameter to ->direct_access()
None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:40:58 -04:00
Dan Williams 4a9bf88a5c Merge branch 'pmem-api' into libnvdimm-for-next 2015-08-27 19:40:26 -04:00
yalin wang a06a757652 nvdimm: change to use generic kvfree()
Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:35:48 -04:00
Ross Zwisler e2e05394e4 pmem, dax: have direct_access use __pmem annotation
Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer.  This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-20 14:07:24 -04:00
Dan Williams 7a67832c7e libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
We currently register a platform device for e820 type-12 memory and
register a nvdimm bus beneath it.  Registering the platform device
triggers the device-core machinery to probe for a driver, but that
search currently comes up empty.  Building the nvdimm-bus registration
into the e820_pmem platform device registration in this way forces
libnvdimm to be built-in.  Instead, convert the built-in portion of
CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
rest of the logic to the driver for e820_pmem, for the following
reasons:

1/ Letting e820_pmem support be a module allows building and testing
   libnvdimm.ko changes without rebooting

2/ All the normal policy around modules can be applied to e820_pmem
   (unbind to disable and/or blacklisting the module from loading by
   default)

3/ Moving the driver to a generic location and converting it to scan
   "iomem_resource" rather than "e820.map" means any other architecture can
   take advantage of this simple nvdimm resource discovery mechanism by
   registering a resource named "Persistent Memory (legacy)"

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-19 00:34:34 -04:00
Christoph Hellwig 708ab62bef pmem: switch to devm_ allocations
Signed-off-by: Christoph Hellwig <hch@lst.de>
[djbw: tools/testing/nvdimm/ and memunmap_pmem support]
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 16:01:21 -04:00
Vishal Verma 6ec689542b libnvdimm, btt: write and validate parent_uuid
When a BTT is instantiated on a namespace it must validate the namespace
uuid matches the 'parent_uuid' stored in the btt superblock. This
property enforces that changing the namespace UUID invalidates all
former BTT instances on that storage. For "IO namespaces" that don't
have a label or UUID, the parent_uuid is set to zero, and this
validation is skipped. For such cases, old BTTs have to be invalidated
by forcing the namespace to raw mode, and overwriting the BTT info
blocks.

Based on a patch by Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:43:04 -04:00
Vishal Verma ab45e76327 libnvdimm, btt: consolidate arena validation
Use arena_is_valid as a common routine for checking the validity of an
info block from both discover_arenas, and nd_btt_probe.

As a result, don't check for validity of the BTT's UUID, and lbasize.
The checksum in the BTT info block guarantees self-consistency, and when
we're called from nd_btt_probe, we don't have a valid uuid or lbasize
available to check against.

Also cleanup to return a bool instead of an int.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:43:04 -04:00
Vishal Verma fbde1414ac libnvdimm, btt: clean up internal interfaces
Consolidate the parameters passed to arena_is_valid into just nd_btt,
and an info block to increase re-usability.

Similarly, btt_arena_write_layout doesn't need to be passed a uuid, as
it can be obtained from arena->nd_btt.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:43:04 -04:00
Randy Dunlap f6ef5a2a50 nvdimm: fix inline function return type warning
Fix multiple build warnings when CONFIG_BTT is not enabled:

In file included from ../drivers/nvdimm/bus.c:29:0:
../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type]
 static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
               ^

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-07-31 18:17:09 -04:00
Christoph Hellwig 4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Vishal Verma 6b47496a6f libnvdimm, pmem: Change pmem physical sector size to PAGE_SIZE
Based on a patch: c8fa317 brd: Request from fdisk 4k alignment by Boaz
Harrosh, allow fdisk to create properly aligned partitions for DAX. This
will also cause mkfs.ext4 to emit a warning if using a file system block
size of less than PAGE_SIZE.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Elliott, Robert <Elliott@hp.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Boaz Harrosh <boaz@plexistor.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-07-27 22:53:19 -04:00
Dan Williams 5e32940621 libnvdimm, btt: sparse fix
Fix:
drivers/nvdimm/btt.c:635:29: warning: restricted __le64 degrades to integer

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-07-27 22:53:19 -04:00
Dan Williams 8ca243536d libnvdimm: fix namespace seed creation
A new BLK namespace "seed" device is created whenever the current seed
is successfully probed.  However, if that namespace is assigned to a BTT
it may never directly experience a successful probe as it is a
subordinate device to a BTT configuration.

The effect of the current code is that no new namespaces can be
instantiated, after the seed namespace, to consume available BLK DPA
capacity.  Fix this by treating a successful BTT probe event as a
successful probe event for the backing namespace.

Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-07-25 09:57:56 -07:00
Axel Lin daa1dee405 nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails
Return proper error if class_create() fails.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-30 14:30:34 -04:00
Dan Williams af834d457d libnvdimm: smatch cleanups in __nd_ioctl
Drop use of access_ok() since we are already using copy_{to|from}_user()
which do their own access_ok().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-30 14:10:09 -04:00
Ross Zwisler 61031952f4 arch, x86: pmem api for ensuring durability of persistent memory updates
Based on an original patch by Ross Zwisler [1].

Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media.  Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range).  A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.

This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[djbw: various reworks]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Toshi Kani 74ae66c3b1 libnvdimm: Add sysfs numa_node to NVDIMM devices
Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
  /sys/bus/nd/devices
  |-- btt0.0/numa_node:0
  |-- btt1.0/numa_node:1
  |-- btt1.1/numa_node:1
  |-- namespace0.0/numa_node:0
  |-- namespace1.0/numa_node:1
  |-- region0/numa_node:0
  |-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
  /sys/class/block/pmem0/device/numa_node:0
  /sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
  numactl --preferred block:pmem0 --show
  numactl --preferred file:/dev/pmem1s --show

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Toshi Kani 41d7a6d637 libnvdimm: Set numa_node to NVDIMM devices
ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams 5813882094 libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence".  For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams 0f51c4fa7f pmem: flag pmem block devices as non-rotational
...since they are effectively SSDs as far as userspace is concerned.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams f0dc089ce2 libnvdimm: enable iostat
This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams edc870e546 pmem: make_request cleanups
Various cleanups:

1/ Kill the BUG_ON since we've already told the block layer we don't
   support DISCARD on all these drivers.

2/ Kill the 'rw' variable, no need to cache it.

3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
   advancing the iterator's sector number by the bio_vec length.

4/ Kill the check for accessing past the end of device
   generic_make_request_checks() already does that.

Suggested-by: Christoph Hellwig <hch@lst.de>
[hch: kill access past end of the device check]
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams 43d3fa3a04 libnvdimm, pmem: fix up max_hw_sectors
There is no hardware limit to enforce on the size of the i/o that can be passed
to an nvdimm block device, so set it to UINT_MAX.

Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Vishal Verma fcae695737 libnvdimm, blk: add support for blk integrity
Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Vishal Verma 41cd8b70c3 libnvdimm, btt: add support for blk integrity
Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.

Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Ross Zwisler 047fc8a1f9 libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Vishal Verma 5212e11fde nd_btt: atomic sector updates
BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-26 11:23:38 -04:00
Dan Williams 8c2f7e8658 libnvdimm: infrastructure for btt devices
NVDIMM namespaces, in addition to accepting "struct bio" based requests,
also have the capability to perform byte-aligned accesses.  By default
only the bio/block interface is used.  However, if another driver can
make effective use of the byte-aligned capability it can claim namespace
interface and use the byte-aligned ->rw_bytes() interface.

The BTT driver is the initial first consumer of this mechanism to allow
adding atomic sector update semantics to a pmem or blk namespace.  This
patch is the sysfs infrastructure to allow configuring a BTT instance
for a namespace.  Enabling that BTT and performing i/o is in a
subsequent patch.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-25 04:20:04 -04:00
Dan Williams 0ba1c63489 libnvdimm: write blk label set
After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams f524bf271a libnvdimm: write pmem label set
After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 1b40e09a12 libnvdimm: blk labels and namespace instantiation
A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams bf9bccc14c libnvdimm: pmem label sets and namespace instantiation.
A complete label set is a PMEM-label per-dimm per-interleave-set where
all the UUIDs match and the interleave set cookie matches the hosting
interleave set.

Present sysfs attributes for manipulation of a PMEM-namespace's
'alt_name', 'uuid', and 'size' attributes.  A later patch will make
these settings persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 4a826c83db libnvdimm: namespace indices: read and validate
This on media label format [1] consists of two index blocks followed by
an array of labels.  None of these structures are ever updated in place.
A sequence number tracks the current active index and the next one to
write, while labels are written to free slots.

    +------------+
    |            |
    |  nsindex0  |
    |            |
    +------------+
    |            |
    |  nsindex1  |
    |            |
    +------------+
    |   label0   |
    +------------+
    |   label1   |
    +------------+
    |            |
     ....nslot...
    |            |
    +------------+
    |   labelN   |
    +------------+

After reading valid labels, store the dpa ranges they claim into
per-dimm resource trees.

[1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams eaf961536e libnvdimm, nfit: add interleave-set state-tracking infrastructure
On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set is always possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 9f53f9fa4a libnvdimm, pmem: add libnvdimm support to the pmem driver
nd_pmem attaches to persistent memory regions and namespaces emitted by
the libnvdimm subsystem, and, same as the original pmem driver, presents
the system-physical-address range as a block device.

The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
that emits an nd_namespace_io device.

Note that the X in 'pmemX' is now derived from the parent region.  This
provides some stability to the pmem devices names from boot-to-boot.
The minor numbers are also more predictable by passing 0 to
alloc_disk().

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 18da2c9ee4 libnvdimm, pmem: move pmem to drivers/nvdimm/
Prepare the pmem driver to consume PMEM namespaces emitted by regions of
an nvdimm_bus instance.  No functional change.

Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 3d88002e4a libnvdimm: support for legacy (non-aliasing) nvdimms
The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels.  For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).

Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 1f7df6f88b libnvdimm, nfit: regions (block-data-window, persistent memory, volatile memory)
A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a PMEM-region
  there will be at least one mapping per dimm in the interleave set.  For
  a BLK-region there is only "mapping0" listing the starting DPA of the
  BLK-region and the available DPA capacity of that space (matches "size"
  above).

The max number of mappings per "region" is hard coded per the
constraints of sysfs attribute groups.  That said the number of mappings
per region should never exceed the maximum number of possible dimms in
the system.  If the current number turns out to not be enough then the
"mappings" attribute clarifies how many there are supposed to be. "32
should be enough for anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 4d88a97aa9 libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure
* Implement the device-model infrastructure for loading modules and
  attaching drivers to nvdimm devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nvdimm devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nvdimm bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 62232e45f4 libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices
Most discovery/configuration of the nvdimm-subsystem is done via sysfs
attributes.  However, some nvdimm_bus instances, particularly the
ACPI.NFIT bus, define a small set of messages that can be passed to the
platform.  For convenience we derive the initial libnvdimm-ioctl command
formats directly from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_ARS_START: initiate scrubbing
    ND_CMD_ARS_STATUS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams e6dfb2de47 libnvdimm, nfit: dimm/memory-devices
Enable nvdimm devices to be registered on a nvdimm_bus.  The kernel
assigned device id for nvdimm devicesis dynamic.  If userspace needs a
more static identifier it should consult a provider-specific attribute.
In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams 45def22c1f libnvdimm: control character device and nvdimm_bus sysfs attributes
The control device for a nvdimm_bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00
Dan Williams b94d5230d0 libnvdimm, nfit: initial libnvdimm infrastructure and NFIT support
A struct nvdimm_bus is the anchor device for registering nvdimm
resources and interfaces, for example, a character control device,
nvdimm devices, and I/O region devices.  The ACPI NFIT (NVDIMM Firmware
Interface Table) is one possible platform description for such
non-volatile memory resources in a system.  The nfit.ko driver attaches
to the "ACPI0012" device that indicates the presence of the NFIT and
parses the table to register a struct nvdimm_bus instance.

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-06-24 21:24:10 -04:00