The following namespace configuration attempt:
# ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f
libndctl: ndctl_dax_enable: dax0.1: failed to enable
Error: namespace0.0: failed to enable
failed to reconfigure namespace: No such device or address
...fails when the backing memory range is not physically aligned to 1G:
# cat /proc/iomem | grep Persistent
210000000-30fffffff : Persistent Memory (legacy)
In the above example the 4G persistent memory range starts and ends on a
256MB boundary.
We handle this case correctly when needing to handle cases that violate
section alignment (128MB) collisions against "System RAM", and we simply
need to extend that padding/truncation for the 1GB alignment use case.
Cc: <stable@vger.kernel.org>
Fixes: 315c562536 ("libnvdimm, pfn: add 'align' attribute...")
Reported-and-tested-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The alignment checks at pfn driver startup fail to properly account for
the 'start_pad' in the case where the namespace is misaligned relative
to its internal alignment. This is typically triggered in 1G aligned
namespace, but could theoretically trigger with small namespace
alignments. When this triggers the kernel reports messages of the form:
dax2.1: bad offset: 0x3c000000 dax disabled align: 0x40000000
Cc: <stable@vger.kernel.org>
Fixes: 1ee6667cd8 ("libnvdimm, pfn, dax: fix initialization vs autodetect...")
Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The kernel's ND_IOCTL_SMART_THRESHOLD command is based on a payload
definition that has become broken / out-of-sync with recent versions of
the NVDIMM_FAMILY_INTEL definition. Deprecate the use of the
ND_IOCTL_SMART_THRESHOLD command in favor of the ND_CMD_CALL approach
taken by NVDIMM_FAMILY_{HPE,MSFT}, where we can manage the per-vendor
variance in userspace.
In a couple years, when the new scheme is widely deployed in userspace
packages, the ND_IOCTL_SMART_THRESHOLD support can be removed. For now
we prevent new binaries from compiling against the kernel header
definitions, but kernel still compatible with old binaries. The
libndctl.h [1] header is now the authoritative interface definition for
NVDIMM SMART.
[1]: https://github.com/pmem/ndctl
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may be
required to satisfy a write fault to also be flushed ("on disk") before
the kernel returns to userspace from the fault handler. Effectively
every write-fault that dirties metadata completes an fsync() before
returning from the fault handler. The new MAP_SHARED_VALIDATE mapping
type guarantees that the MAP_SYNC flag is validated as supported by the
filesystem's ->mmap() file operation.
* Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This
enables interoperability with environments that only implement the
standardized methods.
* Add support for the ACPI 6.2 NVDIMM media error injection methods.
* Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch
last shutdown status, firmware update, SMART error injection, and
SMART alarm threshold control.
* Cleanup physical address information disclosures to be root-only.
* Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
* Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
957ac8c421 dax: fix PMD faults on zero-length files
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
a39e596baa xfs: support for synchronous DAX faults
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaDfvcAAoJEB7SkWpmfYgCk7sP/2qJhBH+VTTdg2osDnhAdAhI
co/AGEmsHFlUCMBb/Ek7UnMAmhBYiJU2q4ywPsNFBpusXpMlqNy5Iwo7k4/wQHE/
SJcIM0g4zg0ViFuUhwV+C2T0R5UzFR8JLd9EYWj/YS6aJpurtotm5l4UStaM0Hzo
AhxSXJLrBDuqCpbOxbctfiGEmdRL7aRfBEAARTNRKBn/iXxJUcYHlp62rtXQS+t4
I6LC/URCWTNTTMGmzW6TRsgSD9WMfd19xKcGzN3qL6ee0KFccxN4ctFqHA/sFGOh
iYLeR0XJUjJxyp+PkWGteXPVZL0Kj3bD/lSTG+Co5bm/ra8a/sh3TSFfgFyoBZD1
EqMN8Ryf80hGp3FabeH2Iw2SviYPZpHSWgjddjxLD0RA6OmpzINc+Wm8eqApjMME
sbZDTOijiab4QMQ0XamF4GuDHyQtawv5Y/w2Ehhl1tmiqW+5tKhsKqxkQt+/V3Yt
RTVSRe2Pkway66b+cD64IdQ6L2tyonPnmi5IzgkKOhlOEGomy+4/U2Jt2bMbhzq6
ymszKmXp2XI8P06wU8sHrIUeXO5I9qoKn/fZA73Eb8aIzgJe3tBE/5+Ab7RG6HB9
1OVfcMWoXU1gNgNktTs63X1Lsg4aW9kt/K4fPHHcqUcaliEJpJTlAbg9GLF2buoW
nQ+0fTRgMRihE3ZA0Fs3
=h2vZ
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"Save for a few late fixes, all of these commits have shipped in -next
releases since before the merge window opened, and 0day has given a
build success notification.
The ext4 touches came from Jan, and the xfs touches have Darrick's
reviewed-by. An xfstest for the MAP_SYNC feature has been through
a few round of reviews and is on track to be merged.
- Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
'userspace flush' of persistent memory updates via filesystem-dax
mappings. It arranges for any filesystem metadata updates that may
be required to satisfy a write fault to also be flushed ("on disk")
before the kernel returns to userspace from the fault handler.
Effectively every write-fault that dirties metadata completes an
fsync() before returning from the fault handler. The new
MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
is validated as supported by the filesystem's ->mmap() file
operation.
- Add support for the standard ACPI 6.2 label access methods that
replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
This enables interoperability with environments that only implement
the standardized methods.
- Add support for the ACPI 6.2 NVDIMM media error injection methods.
- Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
latch last shutdown status, firmware update, SMART error injection,
and SMART alarm threshold control.
- Cleanup physical address information disclosures to be root-only.
- Fix revalidation of the DIMM "locked label area" status to support
dynamic unlock of the label area.
- Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
(system-physical-address) command and error injection commands.
Acknowledgements that came after the commits were pushed to -next:
- 957ac8c421 ("dax: fix PMD faults on zero-length files"):
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
- a39e596baa ("xfs: support for synchronous DAX faults") and
7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"
* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
acpi, nfit: add 'Enable Latch System Shutdown Status' command support
dax: fix general protection fault in dax_alloc_inode
dax: fix PMD faults on zero-length files
dax: stop requiring a live device for dax_flush()
brd: remove dax support
dax: quiet bdev_dax_supported()
fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
tools/testing/nvdimm: unit test clear-error commands
acpi, nfit: validate commands against the device type
tools/testing/nvdimm: stricter bounds checking for error injection commands
xfs: support for synchronous DAX faults
xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
ext4: Support for synchronous DAX faults
ext4: Simplify error handling in ext4_dax_huge_fault()
dax: Implement dax_finish_sync_fault()
dax, iomap: Add support for synchronous faults
mm: Define MAP_SYNC and VM_SYNC flags
dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
dax: Allow dax_iomap_fault() to return pfn
dax: Fix comment describing dax_iomap_fault()
...
As discussed at
https://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com>
someday we will remove rw_page(). If so, we need something to detect
such super-fast storage on which synchronous IO operations like the
current rw_page are always a win.
Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we
could use various optimization techniques.
Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we're reusing the badrange functions for nfit_test, and that
exposes badrange injection/clearing to userspace via the DSM paths, it
is plausible that a user may call the clear DSM multiple times. Since it
is harmless to do so, we can remove the WARN in badrange_forget.
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nfit_test needs to use the poison list manipulation code as well. Make
it more generic and in the process rename poison to badrange, and move
all the related helpers to a new file.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
[vishal: Add badrange.o to nfit_test's Kbuild]
[vishal: add a missed include in bus.c for the new badrange functions]
[vishal: rename all instances of 'be' to 'bre']
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes some spelling typos found in Kconfig files.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The functions create_namespace_pmem and create_namespace_blk are local
to the source and do not need to be in global scope, so make them static.
Cleans up sparse warnings:
symbol 'create_namespace_pmem' was not declared. Should it be static?
symbol 'create_namespace_blk' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Given that we now how have two mechanisms for a DIMM to indicate that it
is locked:
* NVDIMM_FAMILY_INTEL 'get_config_size' _DSM command
* ACPI 6.2 Label Storage Read / Write commands
...export the generic libnvdimm DIMM status in a new 'flags' attribute.
This attribute can also reflect the 'alias' state which indicates
whether the nvdimm core is enforcing labels for aliased-region-capacity
that the given dimm is an interleave-set member.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ACPI 6.2 adds support for named methods to access the label storage area
of an NVDIMM. We prefer these new methods if available and otherwise
fallback to the NVDIMM_FAMILY_INTEL _DSMs. The kernel ioctls,
ND_IOCTL_{GET,SET}_CONFIG_{SIZE,DATA}, remain generic and the driver
translates the 'package' payloads into the NVDIMM_FAMILY_INTEL 'buffer'
format to maintain compatibility with existing userspace and keep the
output buffer parsing code in the driver common.
The output payloads are mostly compatible save for the 'label area
locked' status that moves from the 'config_size' (_LSI) command to the
'config_read' (_LSR) command status.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The set of valid sequence numbers is {1,2,3}. The specification
indicates that an implementation should consider 0 a sign of a critical
error:
UEFI 2.7: 13.19 NVDIMM Label Protocol
Software never writes the sequence number 00, so a correctly
check-summed Index Block with this sequence number probably indicates a
critical error. When software discovers this case it treats it as an
invalid Index Block indication.
While the expectation is that the invalid block is just thrown away, the
Robustness Principle says we should fix this to make both sequence
numbers valid.
Fixes: f524bf271a ("libnvdimm: write pmem label set")
Cc: <stable@vger.kernel.org>
Reported-by: Juston Li <juston.li@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For the same reason that /proc/iomem returns 0's for non-root readers
and acpi tables are root-only, make the 'resource' attribute for pfn
devices only readable by root. Otherwise we disclose physical address
information.
Fixes: f6ed58c70d ("libnvdimm, pfn: 'resource'-address and 'size'...")
Cc: <stable@vger.kernel.org>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For the same reason that /proc/iomem returns 0's for non-root readers
and acpi tables are root-only, make the 'resource' attribute for
namespace devices only readable by root. Otherwise we disclose physical
address information.
Fixes: bf9bccc14c ("libnvdimm: pmem label sets and namespace instantiation")
Cc: <stable@vger.kernel.org>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For the same reason that /proc/iomem returns 0's for non-root readers
and acpi tables are root-only, make the 'resource' attribute for region
devices only readable by root. Otherwise we disclose physical address
information.
Fixes: 802f4be6fe ("libnvdimm: Add 'resource' sysfs attribute to regions")
Cc: <stable@vger.kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If we successfully enable a DIMM then it must not be locked and we can
clear the label-read failure condition. Otherwise, we need to reload the
entire bus provider driver to achieve the same effect, and that can
disrupt unrelated DIMMs and namespaces.
Fixes: 9d62ed9651 ("libnvdimm: handle locked label storage areas")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Maurice reports:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: holder_class_store+0x253/0x2b0 [libnvdimm]
...while trying to reconfigure an NVDIMM-N namespace into 'sector' /
'btt' mode. The crash points to this line:
(gdb) li *(holder_class_store+0x253)
0x7773 is in holder_class_store (drivers/nvdimm/namespace_devs.c:1420).
1415 for (i = 0; i < nd_region->ndr_mappings; i++) {
1416 struct nd_mapping *nd_mapping = &nd_region->mapping[i];
1417 struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
1418 struct nd_namespace_index *nsindex;
1419
1420 nsindex = to_namespace_index(ndd, ndd->ns_current);
...where we are failing because ndd is NULL due to NVDIMM-N dimms not
supporting labels.
Long story short, default to the BTTv1 format in the label-less /
NVDIMM-N case.
Fixes: 14e4945426 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Maurice A. Saldivar <maurice.a.saldivar@hpe.com>
Tested-by: Maurice A. Saldivar <maurice.a.saldivar@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush abstraction
that was stood up for DM's use.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZuo8UAAoJEMUj8QotnQNa5BEIANO4mHh1nrzEbH72a4RCLgxV
H1Pk1zZx/W1bhOOmcRRhxCSM85dPgsCegc5EmpwLZEMavQrP9UZblHcYOUsyIx7W
S/lWa+soOq/5N2OveROc4WdoWVs50UFmc1+BcClc4YrEe+15XC3R0VMkjX2b/hUL
o2eYhPjpMlgaorMtRRU6MAooo2fBRQ9m05aPeVgd35fxibrE7PZm+EYW09wa0STi
9ufuDXJf8+TtFP/38BD41LbUEskuHUZTSDeAJ+3DBaTtfEZcZYxsst4P9JangsHx
jqqqI9aYzFD2a27fl9WLhCvm40YFiKp5nwzED0RZjzWxVa/jTShX7a49BdzTTfw=
=rkSB
-----END PGP SIGNATURE-----
Merge tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Some request-based DM core and DM multipath fixes and cleanups
- Constify a few variables in DM core and DM integrity
- Add bufio optimization and checksum failure accounting to DM
integrity
- Fix DM integrity to avoid checking integrity of failed reads
- Fix DM integrity to use init_completion
- A couple DM log-writes target fixes
- Simplify DAX flushing by eliminating the unnecessary flush
abstraction that was stood up for DM's use.
* tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dax: remove the pmem_dax_ops->flush abstraction
dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
dm integrity: make blk_integrity_profile structure const
dm integrity: do not check integrity for failed read operations
dm log writes: fix >512b sectorsize support
dm log writes: don't use all the cpu while waiting to log blocks
dm ioctl: constify ioctl lookup table
dm: constify argument arrays
dm integrity: count and display checksum failures
dm integrity: optimize writing dm-bufio buffers that are partially changed
dm rq: do not update rq partially in each ending bio
dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
dm mpath: complain about unsupported __multipath_map_bio() return values
dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
* Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
* The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
* A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
* Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
+DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
K4hXLV3fVBVRMiS0RA6z
=5iGY
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm from Dan Williams:
"A rework of media error handling in the BTT driver and other updates.
It has appeared in a few -next releases and collected some late-
breaking build-error and warning fixups as a result.
Summary:
- Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
- The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
- A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
- Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes"
* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, btt: fix format string warnings
libnvdimm, btt: clean up warning and error messages
ext4: fix null pointer dereference on sbi
libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
dax: fix FS_DAX=n BLOCK=y compilation
libnvdimm: fix integer overflow static analysis warning
libnvdimm, nd_blk: remove mmio_flush_range()
libnvdimm, btt: rework error clearing
libnvdimm: fix potential deadlock while clearing errors
libnvdimm, btt: cache sector_size in arena_info
libnvdimm, btt: ensure that flags were also unchanged during a map_read
libnvdimm, btt: refactor map entry operations with macros
libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
ext4: perform dax_device lookup at mount
ext2: perform dax_device lookup at mount
xfs: perform dax_device lookup at mount
dax: introduce a fs_dax_get_by_bdev() helper
libnvdimm, btt: check memory allocation failure
libnvdimm, label: fix index block size calculation
...
Commit abebfbe2f7 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.
It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().
It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.
Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.
Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fix format warnings (seen on i386) in nvdimm/btt.c:
../drivers/nvdimm/btt.c: In function ‘btt_map_init’:
../drivers/nvdimm/btt.c:430:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=]
dev_WARN_ONCE(to_dev(arena), size < 512,
^
../drivers/nvdimm/btt.c: In function ‘btt_log_init’:
../drivers/nvdimm/btt.c:474:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=]
dev_WARN_ONCE(to_dev(arena), size < 512,
^
Fixes: 86652d2eb3 ("libnvdimm, btt: clean up warning and error messages")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Convert all WARN* style messages to dev_WARN, and for errors in the IO
paths, use dev_err_ratelimited. Also remove some BUG_ONs in the IO path
and replace them with the above - no need to crash the machine in case
of an unaligned IO.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.
Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl)
that uses it, as a prevention of a potential double-fetch bug.
While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where
the same userspace memory region are fetched twice into kernel with sanity
checks after the first fetch while missing checks after the second fetch.
In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:
1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)
2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
(line 984 to 986).
3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)
4. Given that `p` can be fully controlled in userspace, an attacker can
race condition to override the header part of `p`, say,
`((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
(say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
second fetch. The changed value will be copied to `buf`.
5. There is no checks on the second fetches until the use of it in
line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
which means that the assumed relation, `p->nd_reserved2` are all zeroes might
not hold after the second fetch. And once the control goes to these functions
we lose the context to assert the assumed relation.
6. Based on my manual analysis, `p->nd_reserved2` is not used in function
`nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
so there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
`p->nd_reserved2` later and assume that they are all zeroes.
Move the validation of the nd_reserved2 field to the ->ndctl()
implementation where it has a stable buffer to evaluate.
Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan reports:
The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for
nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
following static checker warning:
drivers/nvdimm/bus.c:1018 __nd_ioctl()
warn: integer overflows 'buf_len'
From a casual review, this seems like it might be a real bug. On
the first iteration we load some data into in_env[]. On the second
iteration we read a use controlled "in_size" from nd_cmd_in_size().
It can go up to UINT_MAX - 1. A high number means we will fill the
whole in_env[] buffer. But we potentially keep looping and adding
more to in_len so now it can be any value.
It simple enough to change, but it feels weird that we keep looping
even though in_env is totally full. Shouldn't we just return an
error if we don't have space for desc->in_num.
We keep looping because the size of the total input is allowed to be
bigger than the 'envelope' which is a subset of the payload that tells
us how much data to expect. For safety explicitly check that buf_len
does not overflow which is what the checker flagged.
Cc: <stable@vger.kernel.org>
Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
mmio_flush_range() suffers from a lack of clearly-defined semantics,
and is somewhat ambiguous to port to other architectures where the
scope of the writeback implied by "flush" and ordering might matter,
but MMIO would tend to imply non-cacheable anyway. Per the rationale
in 67a3e8fe90 ("nd_blk: change aperture mapping from WC to WB"), the
only existing use is actually to invalidate clean cache lines for
ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
cleanup of the pmem API, that also now happens to be the exact purpose
of arch_invalidate_pmem(), which would be a far more well-defined tool
for the job.
Rather than risk potentially inconsistent implementations of
mmio_flush_range() for the sake of one callsite, streamline things by
removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
definitions up to the libnvdimm level, so they can be shared by NFIT
as well. This allows NFIT to be enabled for arm64.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Clearing errors or badblocks during a BTT write requires sending an ACPI
DSM, which means potentially sleeping. Since a BTT IO happens in atomic
context (preemption disabled, spinlocks may be held), we cannot perform
error clearing in the course of an IO. Due to this error clearing for
BTT IOs has hitherto been disabled.
In this patch we move error clearing out of the atomic section, and thus
re-enable error clearing with BTTs. When we are about to add a block to
the free list, we check if it was previously marked as an error, and if
it was, we add it to the freelist, but also set a flag that says error
clearing will be required. We then drop the lane (ending the atomic
context), and send a zero buffer so that the error can be cleared. The
error flag in the free list is protected by the nd 'lane', and is set
only be a thread while it holds that lane. When the error is cleared,
the flag is cleared, but while holding a mutex for that freelist index.
When writing, we check for two things -
1/ If the freelist mutex is held or if the error flag is set. If so,
this is an error block that is being (or about to be) cleared.
2/ If the block is a known badblock based on nsio->bb
The second check is required because the BTT map error flag for a map
entry only gets set when an error LBA is read. If we write to a new
location that may not have the map error flag set, but still might be in
the region's badblock list, we can trigger an EIO on the write, which is
undesirable and completely avoidable.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths.
Specifically, the DSM to clear media errors is called during writes, so
that we can provide a writes-fix-errors model.
However it is easy to imagine a scenario like:
-> write through the nvdimm driver
-> acpi allocation
-> writeback, causes more IO through the nvdimm driver
-> deadlock
Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO
flag for the current scope when issuing commands/IOs that are expected
to clear errors.
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for the error clearing rework, add sector_size in the
arena_info struct.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In btt_map_read, we read the map twice to make sure that the map entry
didn't change after we added it to the read tracking table. In
anticipation of expanding the use of the error bit, also make sure that
the error and zero flags are constant across the two map reads.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add helpers for converting a raw map entry to just the block number, or
either of the 'e' or 'z' flags in preparation for actually using the
error flag to mark blocks with media errors.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The IO context conversion for rw_bytes missed a case in the BTT write
path (btt_map_write) which should've been marked as atomic.
In reality this should not cause a problem, because map writes are to
small for nsio_rw_bytes to attempt error clearing, but it should be
fixed for posterity.
Add a might_sleep() in the non-atomic section of nsio_rw_bytes so that
things like the nfit unit tests, which don't actually sleep, can catch
bugs like this.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Check memory allocation failures and return -ENOMEM in such cases, as
already done few lines below for another memory allocation.
This avoids NULL pointers dereference.
Cc: <stable@vger.kernel.org>
Fixes: 14e4945426 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The old calculation assumed that the label space was 128k and the label
size is 128. With v1.2 labels where the label size is 256 this
calculation will return zero. We are saved by the fact that the
nsindex_size is always pre-initialized from a previous 128 byte
assumption and we are lucky that the index sizes turn out the same.
Fix this going forward in case we start encountering different
geometries of label areas besides 128k.
Since the label size can change from one call to the next, drop the
caching of nsindex_size.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we properly advertise the supported pte, pmd, and pud sizes,
restrict the supported alignments that can be set on a namespace. This
assumes that userspace was not previously relying on the ability to set
odd alignments. At least ndctl only ever supported setting the namespace
alignment to 4K, 2M, or 1G.
Cc: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The alignment of a DAX and PFN regions dictates the page sizes that can
be used to map the region. Even if the hardware page sizes are known the
actual range of supported page sizes that can be used with DAX depends
on the kernel configuration. As a result it's best that the kernel
advertises the alignments that should be used with these region types.
This patch adds the 'supported_alignments' region attribute to expose
this information to userspace.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[djbw: integrate with nd_size_select_show() rename and other fixups]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Prepare for other another consumer of this size selection scheme that is
not a 'sector size'.
Cc: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is useful to be able to know the position of a DIMM in an
interleave-set. Consider the case where the order of the DIMMs changes
causing a namespace to be invalidated because the interleave-set cookie no
longer matches. If the before and after state of each DIMM position is
known this state debugged by the system owner.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
in common with THP than hugetlbfs we should proably be using
HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
give it a new name.
The other usage of HPAGE_SIZE in libnvdimm is when determining how large
the altmap should be. For the reasons mentioned above it doesn't really
make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
use in generic code and it happens to match the vmemmap allocation block
on x86 and Power. It's still a hack, but it's a slightly nicer hack.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
__add_badblock_range() does not account sector alignment when
it sets 'num_sectors'. Therefore, an ARS error record range
spanning across two sectors is set to a single sector length,
which leaves the 2nd sector unprotected.
Change __add_badblock_range() to set 'num_sectors' properly.
Cc: <stable@vger.kernel.org>
Fixes: 0caeef63e6 ("libnvdimm: Add a poison list and export badblocks")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
pull request. It's a bit of a mixed bag, this contains:
- A followup pull request from Sagi for NVMe. Outside of fixups for
NVMe, it also includes a series for ensuring that we properly
quiesce hardware queues when browsing live tags.
- Set of integrity fixes from Dmitry (mostly), fixing various issues
for folks using DIF/DIX.
- Fix for a bug introduced in cciss, with the req init changes. From
Christoph.
- Fix for a bug in BFQ, from Paolo.
- Two followup fixes for lightnvm/pblk from Javier.
- Depth fix from Ming for blk-mq-sched.
- Also from Ming, performance fix for mtip32xx that was introduced
with the dynamic initialization of commands"
* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
block: call bio_uninit in bio_endio
nvmet: avoid unneeded assignment of submit_bio return value
nvme-pci: add module parameter for io queue depth
nvme-pci: compile warnings in nvme_alloc_host_mem()
nvmet_fc: Accept variable pad lengths on Create Association LS
nvme_fc/nvmet_fc: revise Create Association descriptor length
lightnvm: pblk: remove unnecessary checks
lightnvm: pblk: control I/O flow also on tear down
cciss: initialize struct scsi_req
null_blk: fix error flow for shared tags during module_init
block: Fix __blkdev_issue_zeroout loop
nvme-rdma: unconditionally recycle the request mr
nvme: split nvme_uninit_ctrl into stop and uninit
virtio_blk: quiesce/unquiesce live IO when entering PM states
mtip32xx: quiesce request queues to make sure no submissions are inflight
nbd: quiesce request queues to make sure no submissions are inflight
nvme: kick requeue list when requeueing a request instead of when starting the queues
nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
...
* Introduce the _flushcache() family of memory copy helpers and use them
for persistent memory write operations on x86. The _flushcache()
semantic indicates that the cache is either bypassed for the copy
operation (movnt) or any lines dirtied by the copy operation are
written back (clwb, clflushopt, or clflush).
* Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
* Add support for the new NVDIMM platform/firmware mechanisms introduced
in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace
label format, extensions to the address-range-scrub command set, new
error injection commands, and a new BTT (block-translation-table)
layout. These updates support inter-OS and pre-OS compatibility.
* Fix a longstanding memory corruption bug in nfit_test.
* Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
* Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed:
commit 6aa734a2f3 "libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime"
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e
mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X
rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7
5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT
WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks
rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ
btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i
2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6
MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7
JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg
BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ
pMHuMAqIcIyLhIe2/sRF
=xKQB
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"libnvdimm updates for the latest ACPI and UEFI specifications. This
pull request also includes new 'struct dax_operations' enabling to
undo the abuse of copy_user_nocache() for copy operations to pmem.
The dax work originally missed 4.12 to address concerns raised by Al.
Summary:
- Introduce the _flushcache() family of memory copy helpers and use
them for persistent memory write operations on x86. The
_flushcache() semantic indicates that the cache is either bypassed
for the copy operation (movnt) or any lines dirtied by the copy
operation are written back (clwb, clflushopt, or clflush).
- Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
- Add support for the new NVDIMM platform/firmware mechanisms
introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
namespace label format, extensions to the address-range-scrub
command set, new error injection commands, and a new BTT
(block-translation-table) layout. These updates support inter-OS
and pre-OS compatibility.
- Fix a longstanding memory corruption bug in nfit_test.
- Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
- Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed: commit
6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
<toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
libnvdimm, namespace: record 'lbasize' for pmem namespaces
acpi/nfit: Issue Start ARS to retrieve existing records
libnvdimm: New ACPI 6.2 DSM functions
acpi, nfit: Show bus_dsm_mask in sysfs
libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
acpi, nfit: Enable DSM pass thru for root functions.
libnvdimm: passthru functions clear to send
libnvdimm, btt: convert some info messages to warn/err
libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
libnvdimm: fix the clear-error check in nsio_rw_bytes
libnvdimm, btt: fix btt_rw_page not returning errors
acpi, nfit: quiet invalid block-aperture-region warnings
libnvdimm, btt: BTT updates for UEFI 2.7 format
acpi, nfit: constify *_attribute_group
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem, dax: export a cache control attribute
dax: convert to bitmask for flags
dax: remove default copy_from_iter fallback
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: fix persistence warning
...
Commit f979b13c3c "libnvdimm, label: honor the lba size specified in
v1.2 labels") neglected to update the 'lbasize' in the label when the
namespace sector_size attribute was written. We need this value in the
label for inter-OS / pre-OS compatibility.
Fixes: f979b13c3c ("libnvdimm, label: honor the lba size specified in v1.2 labels")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.
Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.
This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size
Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: switch to true/false returns instead of errno values]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.
In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Have dsm functions called via the pass thru mechanism also
be checked against clear to send.
Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some critical messages such as IO errors, metadata failures were printed
with dev_info. Make them louder by upgrading them to dev_warn or
dev_error.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We need to hold a reference on the 'dirent' until we are sure there are
no more notifications that will be sent. As noted in the new comments we
take advantage of the fact that the references are taken and dropped
under device_lock() and that nd_device_notify() holds device_lock() over
new badblocks notifications. The notifications that happen when
badblocks are cleared only occur while the device is active.
Also take the opportunity to fix up the error messages to report the
user visible effect of a sysfs_get_dirent() failure.
Fixes: 975750a98c ("libnvdimm, pmem: Add sysfs notifications to badblocks")
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A leftover from the 'bandaid' fix that disabled BTT error clearing in
rw_bytes resulted in an incorrect check. After we converted these checks
over to use the NVDIMM_IO_ATOMIC flag, the ndns->claim check was both
redundant, and incorrect. Remove it.
Fixes: 3ae3d67ba7 ("libnvdimm: add an atomic vs process context flag to rw_bytes")
Cc: <stable@vger.kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
btt_rw_page was not propagating errors frm btt_do_bvec, resulting in any
IO errors via the rw_page path going unnoticed. the pmem driver recently
fixed this in e10624f pmem: fail io-requests to known bad blocks
but same problem in BTT went neglected.
Fixes: 5212e11fde ("nd_btt: atomic sector updates")
Cc: <stable@vger.kernel.org>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This state is already visible by userspace since the BLK region will not
be enabled, and it is otherwise benign as it usually indicates that the
DIMM is not configured.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The UEFI 2.7 specification defines an updated BTT metadata format,
bumping the revision to 2.0. Add support for the new format, while
retaining compatibility for the old 1.1 format.
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set dax_write_cache() if it is.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The dax_flush() operation can be turned into a nop on platforms where
firmware arranges for cpu caches to be flushed on a power-fail event.
The ACPI 6.2 specification defines a mechanism for the platform to
indicate this capability so the kernel can select the proper default.
However, for other platforms, the administrator must toggle this setting
manually.
Given this flush setting is a dax-specific mechanism we advertise it
through a 'dax' attribute group hanging off a host device. For example,
a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a
'write_cache' attribute to control response to dax cache flush requests.
This is similar to the 'queue/write_cache' attribute that appears under
block devices.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h and
asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove include/linux/pmem.h and asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to
libnvdimm and the pmem driver directly we do not need to provide global
wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem
driver will simply not link to arch_wb_cache_pmem() in that case. Same
as before, pmem flushing is only defined for x86_64, via
clean_cache_range(), but it is straightforward to add other archs in the
future.
arch_wb_cache_pmem() is an exported function since the pmem module needs
to find it, but it is privately declared in drivers/nvdimm/pmem.h because
there are no consumers outside of the pmem driver.
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.
An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Sysfs "badblocks" information may be updated during run-time that:
- MCE, SCI, and sysfs "scrub" may add new bad blocks
- Writes and ioctl() may clear bad blocks
Add support to send sysfs notifications to sysfs "badblocks" file
under region and pmem directories when their badblocks information
is re-evaluated (but is not necessarily changed) during run-time.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The rules for which version of the label specification are in effect at
any given point in time are as follows:
1/ If a DIMM has an existing / valid index block then the version
specified is used regardless if it is a previous version.
2/ By default when the kernel is initializing new index blocks the
latest specification version (v1.2 at time of writing) is used.
3/ An environment that wants to force create v1.1 label-sets must
arrange for userspace to disable all active regions / namespaces /
dimms and write a valid set of v1.1 index blocks to the dimms.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Starting with v1.2 labels, 'address abstractions' can be hinted via an
address abstraction id that implies an info-block format. The standard
address abstraction in the specification is the v2 format of the
Block-Translation-Table (BTT). Support for that is saved for a later
patch, for now we add support for the Linux supported address
abstractions BTT (v1), PFN, and DAX.
The new 'holder_class' attribute for namespace devices is added for
tooling to specify the 'abstraction_guid' to store in the namespace label.
For v1.1 labels this field is undefined and any setting of
'holder_class' away from the default 'none' value will only have effect
until the driver is unloaded. Setting 'holder_class' requires that
whatever device tries to claim the namespace must be of the specified
class.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The v1.2 namespace label specification adds a fletcher checksum to each
label instance. Add generation and validation support for the new field.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The v1.2 namespace label specification requires 'nlabel' and 'position'
to be valid for the first ("lowest dpa") label in the set. It also
requires all non-first labels to set those fields to 0xff.
Linux does not much care if these values are correct, because we can
just trust the count of labels with the matching uuid like the v1.1
case. However, we set them correctly in case other environments care.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Starting with the v1.2 definition of namespace labels, the isetcookie
field is populated and validated for blk-aperture namespaces. This adds
some safety against inadvertent copying of namespace labels from one
DIMM-device to another.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The type_guid refers to the "Address Range Type GUID" for the region
backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface
Table). This 'type' identifier specifies an access mechanism for the
given namespace. This capability replaces the confusing usage of the
'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Previously we only honored the lba size for blk-aperture mode
namespaces. For pmem namespaces the lba size was just assumed to be 512.
With the new v1.2 label definition and compatibility with other
operating environments, the ->lbasize property is now respected for pmem
namespaces.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The interleave-set-cookie algorithm is extended to incorporate all the
same components that are used to generate an nvdimm unique-id. For
backwards compatibility we still maintain the old v1.1 definition.
Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com>
Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In support of improved interoperability between operating systems and pre-boot
environments the Intel proposed NVDIMM Namespace Specification [1], has been
adopted and modified to the the UEFI 2.7 NVDIMM Label Protocol [2].
Update the definitions of the namespace label data structures so that the new
format can be supported alongside the existing label format.
The new specification changes the default label size to 256 bytes, so
everywhere that relied on sizeof(struct nd_namespace_label) must now use the
sizeof_namespace_label() helper.
There should be no functional differences from these changes as the
default is still the v1.1 128-byte format. Future patches will move the
default to the v1.2 definition.
[1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
[2]: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes are not
cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
(non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
to ensure data-writes have reached a power-fail-safe zone in the platform. The
fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
around and fence previous writes with an "sfence".
Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
memcpy_flushcache, that guarantee that the destination buffer is not dirty in
the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
will be used to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.
This is meant to satisfy the concern from Linus that if a driver wants to do
something beyond the normal nocache semantics it should be something private to
that driver [1], and Al's concern that anything uaccess related belongs with
the rest of the uaccess code [2].
The first consumer of this interface is a new 'copy_from_iter' dax operation so
that pmem can inject cache maintenance operations without imposing this
overhead on other dax-capable drivers.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hoist the libnvdimm helper as an inline helper to linux/uuid.h
using an auxiliary const variable uuid_null in lib/uuid.c.
[hch: also add the guid variant. Both do the same but I'd like
to keep casts to a minimum]
The common helper uses the new abstract type uuid_t * instead of
u8 *.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: added guid_is_null]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:
- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.
- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.
- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.
- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.
These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
If we had badblocks/poison in the metadata area of a BTT, recreating the
BTT would not clear the poison in all cases, notably the flog area. This
is because rw_bytes will only clear errors if the request being sent
down is 512B aligned and sized.
Make sure that when writing the map and info blocks, the rw_bytes being
sent are of the correct size/alignment. For the flog, instead of doing
the smaller log_entry writes only, first do a 'wipe' of the entire area
by writing zeroes in large enough chunks so that errors get cleared.
Cc: Andy Rudoff <andy.rudoff@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nsio_rw_bytes can clear media errors, but this cannot be done while we
are in an atomic context due to locking within ACPI. From the BTT,
->rw_bytes may be called either from atomic or process context depending
on whether the calls happen during initialization or during IO.
During init, we want to ensure error clearing happens, and the flag
marking process context allows nsio_rw_bytes to do that. When called
during IO, we're in atomic context, and error clearing can be skipped.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.
This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.
Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Region media error reporting: A libnvdimm region device is the parent
to one or more namespaces. To date, media errors have been reported via
the "badblocks" attribute attached to pmem block devices for namespaces
in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
or "btt-sector" mode this new interface reports media errors
generically, i.e. independent of namespace modes or state. This
subsequently allows userspace tooling to craft "ACPI 6.1 Section
9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
submit them via the ioctl path for NVDIMM root bus devices.
* Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
a request from Linus and feedback from Christoph this allows for dax
capable drivers to publish their own custom dax operations. This fixes
the broken assumption that all dax operations are related to a
persistent memory device, and makes it easier for other architectures
and platforms to add customized persistent memory support.
* 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger memory
controllers to drain write-pending buffers that would otherwise be
flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
mechanism at a power loss event. Support for "locked" DIMMs is included
to prevent namespaces from surfacing when the namespace label data area
is locked. Finally, fixes for various reported deadlocks and crashes,
also tagged for -stable.
* ACPI / nfit driver updates: General updates of the nfit driver to add
DSM command overrides, ACPI 6.1 health state flags support, DSM payload
debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
Tested-by: Yi Zhang <yizhan@redhat.com>
commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
TmJQ3Ly9t3o3/weHSzmn
=Ox0s
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.
Change summary:
- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.
This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.
- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.
- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.
- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
- commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang <yizhan@redhat.com>
- commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...
Fix failures to create namespaces due to the vmem_altmap not advertising
enough free space to store the memmap.
WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
[..]
Call Trace:
dump_stack+0x63/0x83
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
arch_add_memory+0xde/0xf0
devm_memremap_pages+0x244/0x440
pmem_attach_disk+0x37e/0x490 [nd_pmem]
nd_pmem_probe+0x7e/0xa0 [nd_pmem]
nvdimm_bus_probe+0x71/0x120 [libnvdimm]
driver_probe_device+0x2bb/0x460
bind_store+0x114/0x160
drv_attr_store+0x25/0x30
In commit 658922e57b "libnvdimm, pfn: fix memmap reservation sizing"
we arranged for the capacity to be allocated, but failed to also update
the 'npfns' parameter. This leads to cases where there is enough
capacity reserved to hold all the allocated sections, but
vmemmap_populate_hugepages() still encounters -ENOMEM from
altmap_alloc_block_buf().
This fix is a stop-gap until we can teach the core memory hotplug
implementation to permit sub-section hotplug.
Cc: <stable@vger.kernel.org>
Fixes: 658922e57b ("libnvdimm, pfn: fix memmap reservation sizing")
Reported-by: Anisha Allada <anisha.allada@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Per the latest version of the "NVDIMM DSM Interface Example" [1], the
label data retrieval routine can report a "locked" status. In this case
all regions associated with that DIMM are disabled until the label area
is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
data area locking capabilities.
[1]: http://pmem.io/documents/
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull x86 mm updates from Ingo Molnar:
"The main x86 MM changes in this cycle were:
- continued native kernel PCID support preparation patches to the TLB
flushing code (Andy Lutomirski)
- various fixes related to 32-bit compat syscall returning address
over 4Gb in applications, launched from 64-bit binaries - motivated
by C/R frameworks such as Virtuozzo. (Dmitry Safonov)
- continued Intel 5-level paging enablement: in particular the
conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)
- x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)
- ... plus misc updates, fixes and cleanups"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
x86/mm: Fix flush_tlb_page() on Xen
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm/64: Fix crash in remove_pagetable()
Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
x86/boot/e820: Remove a redundant self assignment
x86/mm: Fix dump pagetables for 4 levels of page tables
x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
x86/espfix: Add support for 5-level paging
x86/kasan: Extend KASAN to support 5-level paging
x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
x86/paravirt: Add 5-level support to the paravirt code
x86/mm: Define virtual memory map for 5-level paging
x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
x86/boot: Detect 5-level paging support
x86/mm/numa: Remove numa_nodemask_from_meminfo()
...
This continues the 4.11 status quo of disabling of error clearing from
the BTT I/O path. Toshi found that even though we have eliminated all
the libnvdimm sources of sleeping-while-atomic triggers, we still have
sleeping operations that will occur in the path to send the ACPI DSM to
the DIMM to clear the error:
BUG: sleeping function called from invalid context at mm/slab.h:432
in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
Call Trace:
dump_stack+0x86/0xc3
___might_sleep+0x17d/0x250
__might_sleep+0x4a/0x80
__kmalloc+0x1c0/0x2e0
acpi_os_allocate_zeroed+0x2d/0x2f
acpi_evaluate_object+0x59/0x3b1
acpi_evaluate_dsm+0xbd/0x10c
acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
? nsio_rw_bytes+0x152/0x280
nvdimm_clear_poison+0x77/0x140
nsio_rw_bytes+0x18f/0x280
btt_write_pg+0x1d4/0x3d0 [nd_btt]
btt_make_request+0x119/0x2d0 [nd_btt]
A solution for tracking and handling media errors natively in the BTT is
needed.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A debug patch to turn the standard device_lock() into something that
lockdep can analyze yielded the following:
======================================================
[ INFO: possible circular locking dependency detected ]
4.11.0-rc4+ #106 Tainted: G O
-------------------------------------------------------
lt-libndctl/1898 is trying to acquire lock:
(&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm]
but task is already holding lock:
(&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
lock_acquire+0xf6/0x1f0
__mutex_lock+0x88/0x980
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
nd_pmem_probe+0x14/0x180 [nd_pmem]
nvdimm_bus_probe+0xa9/0x260 [libnvdimm]
-> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
__lock_acquire+0x1107/0x1280
lock_acquire+0xf6/0x1f0
__mutex_lock+0x88/0x980
mutex_lock_nested+0x1b/0x20
nd_attach_ndns+0x178/0x1b0 [libnvdimm]
nd_namespace_store+0x308/0x3c0 [libnvdimm]
namespace_store+0x87/0x220 [libnvdimm]
In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.
Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
nd_{attach,detach}_ndns() operations.
Cc: <stable@vger.kernel.org>
Fixes: 8c2f7e8658 ("libnvdimm: infrastructure for btt devices")
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.
The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)
But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.
One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)
So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:
Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.
This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.
This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.
An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting on any data that
is subject to change under that lock. The badblocks instance has its own
internal lock to handle mutations of the error list.
So, in order to make it clear that we are just acting on region devices,
rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
Eliminate the lock and consolidate all support routines for the new
nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
opportunity to cleanup to some unnecessary casts, make the calling
convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
resource with the minimal struct clear_badblocks_context, and use the
DEVICE_ATTR macro.
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
of error actually cleared, which may be smaller than its requested
'len'.
Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
'clear_err.cleared' when this value is valid.
Cc: <stable@vger.kernel.org>
Fixes: e046114af5 ("libnvdimm: clear the internal poison_list when clearing badblocks")
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The following BUG was observed when nd_pmem_notify() was called
for a BTT device. The use of a pmem_device pointer is not valid
with BTT.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
Call Trace:
nd_device_notify+0x40/0x50
child_notify+0x10/0x20
device_for_each_child+0x50/0x90
nd_region_notify+0x20/0x30
nd_device_notify+0x40/0x50
nvdimm_region_notify+0x27/0x30
acpi_nfit_scrub+0x341/0x590 [nfit]
process_one_work+0x197/0x450
worker_thread+0x4e/0x4a0
kthread+0x109/0x140
Fix nd_pmem_notify() by setting nd_region and badblocks pointers
properly for BTT.
Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Fixes: 719994660c ("libnvdimm: async notification support")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.
When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.
The new 'deep_flush' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.
Why sysfs and not an ioctl? An ioctl requires establishing a new
ioctl function number space for device-dax. Given that this would be
called on a device-dax fd an application could be forgiven for
accidentally calling this on a filesystem-dax fd. Placing this interface
in libnvdimm sysfs removes that potential for collision with a
filesystem ioctl, and it keeps ioctls out of the generic device-dax
implementation.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nvdimm_clear_poison() expects a physical address, not an offset.
Fix nsio_rw_bytes() to call nvdimm_clear_poison() with a physical
address.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
serves no real benefit aside from affording a more generic function name
than the x86-specific 'mcsafe'. However this would not be the first time
that x86 terminology leaked into the global namespace. For lack of
better name, just use memcpy_mcsafe() directly.
This conversion also catches a place where we should have been using
plain memcpy, acpi_nfit_blk_single_io().
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In the case where a dimm does not have any associated flush hints the
ndrd->flush_wpq array may be uninitialized leading to crashes with the
following signature:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: region_visible+0x10f/0x160 [libnvdimm]
Call Trace:
internal_create_group+0xbe/0x2f0
sysfs_create_groups+0x40/0x80
device_add+0x2d8/0x650
nd_async_device_register+0x12/0x40 [libnvdimm]
async_run_entry_fn+0x39/0x170
process_one_work+0x212/0x6c0
? process_one_work+0x197/0x6c0
worker_thread+0x4e/0x4a0
kthread+0x10c/0x140
? process_one_work+0x6c0/0x6c0
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x31/0x40
Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Fixes: f284a4f237 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Setup a dax_device to have the same lifetime as the pmem block device
and add a ->direct_access() method that is equivalent to
pmem_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old pmem_direct_access() will be removed.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This reverts commit 4aa5615e08 "libnvdimm: band aid btt vs clear
poison locking".
Now that poison list locking has been converted to a spinlock and poison
list entry allocation during i/o has been converted to GFP_NOWAIT,
revert the band-aid that disabled error clearing from btt i/o.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.
BUG: sleeping function called from invalid context at kernel/locking/mutex.
c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0
A spinlock is introduced to protect the poison list. This allows us to not
having to acquire the reconfig_mutex for touching the poison list. The
add_poison() function has been broken out into two helper functions. One to
allocate the poison entry and the other to apppend the entry. This allows us
to unlock the poison_lock in non-I/O path and continue to be able to allocate
the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
order to satisfy being in atomic context.
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
[..]
Call Trace:
dump_stack+0x85/0xc8
___might_sleep+0x184/0x250
__might_sleep+0x4a/0x90
__mutex_lock+0x58/0x9b0
? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
? acpi_nfit_forget_poison+0x79/0x80 [nfit]
? _raw_spin_unlock+0x27/0x40
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
nsio_rw_bytes+0x164/0x270 [libnvdimm]
btt_write_pg+0x1de/0x3e0 [nd_btt]
? blk_queue_enter+0x30/0x290
btt_make_request+0x11a/0x310 [nd_btt]
? blk_queue_enter+0xb7/0x290
? blk_queue_enter+0x30/0x290
generic_make_request+0x118/0x3b0
As a minimal fix, disable error clearing when the BTT is enabled for the
namespace. For the final fix a larger rework of the poison list locking
is needed.
Note that this is not a problem in the blk case since that path never
calls nvdimm_clear_poison().
Cc: <stable@vger.kernel.org>
Fixes: 82bf1037f2 ("libnvdimm: check and clear poison before writing to pmem")
Cc: Dave Jiang <dave.jiang@intel.com>
[jeff: dynamically disable error clearing in the btt case]
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Holding the reconfig_mutex over a potential userspace fault sets up a
lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
path. Move the user access outside of the lock.
[ INFO: possible circular locking dependency detected ]
4.11.0-rc3+ #13 Tainted: G W O
-------------------------------------------------------
fallocate/16656 is trying to acquire lock:
(&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
but task is already holding lock:
(jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (jbd2_handle){++++..}:
lock_acquire+0xbd/0x200
start_this_handle+0x16a/0x460
jbd2__journal_start+0xe9/0x2d0
__ext4_journal_start_sb+0x89/0x1c0
ext4_dirty_inode+0x32/0x70
__mark_inode_dirty+0x235/0x670
generic_update_time+0x87/0xd0
touch_atime+0xa9/0xd0
ext4_file_mmap+0x90/0xb0
mmap_region+0x370/0x5b0
do_mmap+0x415/0x4f0
vm_mmap_pgoff+0xd7/0x120
SyS_mmap_pgoff+0x1c5/0x290
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xc2
-> #1 (&mm->mmap_sem){++++++}:
lock_acquire+0xbd/0x200
__might_fault+0x70/0xa0
__nd_ioctl+0x683/0x720 [libnvdimm]
nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
do_vfs_ioctl+0xa8/0x740
SyS_ioctl+0x79/0x90
do_syscall_64+0x6c/0x200
return_from_SYSCALL_64+0x0/0x7a
-> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
__lock_acquire+0x16b6/0x1730
lock_acquire+0xbd/0x200
__mutex_lock+0x88/0x9b0
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
pmem_make_request+0xf9/0x270 [nd_pmem]
generic_make_request+0x118/0x3b0
submit_bio+0x75/0x150
Cc: <stable@vger.kernel.org>
Fixes: 62232e45f4 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
Cc: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit a1f3e4d6a0 "libnvdimm, region: update nd_region_available_dpa()
for multi-pmem support" reworked blk dpa (DIMM Physical Address)
accounting to comprehend multiple pmem namespace allocations aliasing
with a given blk-dpa range.
The following call trace is a result of failing to account for allocated
blk capacity.
WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names
4 size_store+0x6f3/0x930 [libnvdimm]
nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
size_store+0x6f3/0x930 [libnvdimm]
dev_attr_store+0x18/0x30
If a given blk-dpa allocation does not alias with any pmem ranges then
the full allocation should be accounted as busy space, not the size of
the current pmem contribution to the region.
The thinkos that led to this confusion was not realizing that the struct
resource management is already guaranteeing no collisions between pmem
allocations and blk allocations on the same dimm. Also, we do not try to
support blk allocations in aliased pmem holes.
This patch also fixes a case where the available blk goes negative.
Cc: <stable@vger.kernel.org>
Fixes: a1f3e4d6a0 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support").
Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created. The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.
Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:
1/ platforms with namespace-label capable NVDIMMs are not widely
available.
2/ interleave-sets with a single-dimm are by definition not affected
(nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.
Cc: <stable@vger.kernel.org>
Fixes: eaf961536e ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When vmemmap_populate() allocates space for the memmap it does so in 2MB
sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
when the alignment of the device is set to 4K. When this happens we
trigger memory allocation failures in altmap_alloc_block_buf() and
trigger warnings of the form:
WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
arch_add_memory+0xe4/0xf0
devm_memremap_pages+0x29b/0x4e0
Fixes: 315c562536 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Given that the naming of pmem devices changes from the pmemX form to the
pmemX.Y form when namespace id is greater than 0, arrange for namespaces
with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
of an existing namespace to a new mode results in a name change of the
resulting block device:
# ndctl list --namespace=namespace1.0
{
"dev":"namespace1.0",
"mode":"raw",
"size":2147483648,
"uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
"blockdev":"pmem1"
}
# ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
{
"dev":"namespace1.1",
"mode":"memory",
"size":2111832064,
"uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
"blockdev":"pmem1.1"
}
This change does require tooling changes to explicitly look for
namespaceX.0 if the seed has already advanced to another namespace.
Cc: <stable@vger.kernel.org>
Fixes: 98a29c39dc ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to declaration of device_type structure.
File size before:
text data bss dec hex filename
19278 3199 16 22493 57dd nvdimm/namespace_devs.o
File size after:
text data bss dec hex filename
19929 3160 16 23105 5a41 nvdimm/namespace_devs.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit 98a29c39dc ("libnvdimm, namespace: allow creation of multiple
pmem-namespaces per region") added support for establishing additional
pmem namespace beyond the seed device, similar to blk namespaces.
However, it neglected to delete the namespace when the size is set to
zero.
Fixes: 98a29c39dc ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT
error code indicates a failed read. Block I/O should use EIO to
indicate failure. Other pmem code paths (like bad blocks) already use
EIO so let's be consistent.
This fixes compatibility with consumers like btrfs that try to parse the
specific error code rather than treat all errors the same.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Dynamic label support: To date namespace label support has been
limited to disambiguating cases where PMEM (direct load/store) and BLK
(mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added
support for multiple namespaces per PMEM-region there is value to
support namespace labels even in the non-aliasing case. The presence of
a valid namespace index block force-enables label support when the
kernel would otherwise rely on region boundaries, and permits the region
to be sub-divided.
* Handle media errors in namespace metadata: Complement the error
handling for media errors in namespace data areas with support for
clearing errors on writes, and downgrading potential machine-check
exceptions to simple i/o errors on read.
* Device-DAX region attributes: Add 'align', 'id', and 'size' as
attributes for device-dax regions. In particular this enables userspace
tooling to generically size memory mapping and i/o operations. Prevent
userspace from growing assumptions / dependencies about the parent
device topology for a dax region. A libnvdimm namespace may not always
be the parent device of a dax region.
* Various cleanups and small fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYVxRcAAoJEB7SkWpmfYgCSnoQAKNs7IJRtGqKFCkMB5VDAW79
Lifi35Hqm+mfFaLMIyRoKJMnhXKQBsthfcHZIy5t63kDWEV/tJ+riZ+m2Ibz4GPE
AUn07jxge2q9v0RwMylqpZ/EHyaK/xD1z0+SdIwVF2VaEDoVdnmu2WEqjuir/lfM
t70B/YoWLg6CcHeTzaV5vdxlRKyG4pzOV3UzPlYLROr3wFJBHD88j/4zcGy3F1ue
DpzGThhtEQXZUZCJLp54E0ch7V2cz/DNUDYwsjbCZrdYYmUJFqjZQxNyftn28aEJ
HI/BERXLhm2HF2qRqC+0C1lJmUylYk4wXUGco+b/u1GYN59eFe/JT3dJgxEHFKL2
nVOUmR8XsRE6gkFWQGcFesRjjI1Rqz2Wgql7oqkSL9grGOwY4rvKX0LhJxgvfuZ0
Itb5h0dMFUriAP0DaaJFMYfANIOx+mqr4RxDMwP5ZADku6+Ga0LXxvcBt06K/mYO
/zLo27MhsuuFy/odXm1A21OjFiH2pM3pSCaytKvDjy7DwzgE7ZzXmixXQ7j8YVp6
OtLYzMjIt8nt4xh0hmav5tb0r9l6mgNqlifacMC9lEM/7SDAiBXXBLuQngJN/j0s
iXllBk0pYVWibf3VcD6oY+qKdmBWvXxPjgj1lHE6j9A7Gw2jwIt/W2NWzV94nKmC
f6d+faHiRokqyVhdgfx3
=YWfP
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The libnvdimm pull request is relatively small this time around due to
some development topics being deferred to 4.11.
As for this pull request the bulk of it has been in -next for several
releases leading to one late fix being added (commit 868f036fee
("libnvdimm: fix mishandled nvdimm_clear_poison() return value")). It
has received a build success notification from the 0day-kbuild robot
and passes the latest libnvdimm unit tests.
Summary:
- Dynamic label support: To date namespace label support has been
limited to disambiguating cases where PMEM (direct load/store) and
BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since
4.9 added support for multiple namespaces per PMEM-region there is
value to support namespace labels even in the non-aliasing case.
The presence of a valid namespace index block force-enables label
support when the kernel would otherwise rely on region boundaries,
and permits the region to be sub-divided.
- Handle media errors in namespace metadata: Complement the error
handling for media errors in namespace data areas with support for
clearing errors on writes, and downgrading potential machine-check
exceptions to simple i/o errors on read.
- Device-DAX region attributes: Add 'align', 'id', and 'size' as
attributes for device-dax regions. In particular this enables
userspace tooling to generically size memory mapping and i/o
operations. Prevent userspace from growing assumptions /
dependencies about the parent device topology for a dax region. A
libnvdimm namespace may not always be the parent device of a dax
region.
- Various cleanups and small fixes"
* tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: add region 'id', 'size', and 'align' attributes
libnvdimm: fix mishandled nvdimm_clear_poison() return value
libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held
libnvdimm, pfn: fix align attribute
libnvdimm, e820: use module_platform_driver
libnvdimm, namespace: use octal for permissions
libnvdimm, namespace: avoid multiple sector calculations
libnvdimm: remove else after return in nsio_rw_bytes()
libnvdimm, namespace: fix the type of name variable
libnvdimm: use consistent naming for request_mem_region()
nvdimm: use the right length of "pmem"
libnvdimm: check and clear poison before writing to pmem
tools/testing/nvdimm: dynamic label support
libnvdimm: allow a platform to force enable label support
libnvdimm: use generic iostat interfaces
Colin, via static analysis, reports that the length could be negative
from nvdimm_clear_poison() in the error case. There was a similar
problem with commit 0a3f27b9a6 "libnvdimm, namespace: avoid multiple
sector calculations" that I noticed when merging the for-4.10/libnvdimm
topic branch into libnvdimm-for-next, but I missed this one. Fix both of
them to the following procedure:
* if we clear a block's worth of media, clear that many blocks in
badblocks
* if we clear less than the requested size of the transfer return an
error
* always invalidate cache after any non-error / non-zero
nvdimm_clear_poison result
Fixes: 82bf1037f2 ("libnvdimm: check and clear poison before writing to pmem")
Fixes: 0a3f27b9a6 ("libnvdimm, namespace: avoid multiple sector calculations")
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Dave Jiang <dave.jiang@intel.com>
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For warnings that should only ever trigger during development and
testing replace WARN statements with lockdep_assert_held. The lockdep
pattern is prevalent, and these paths are are well covered by libnvdimm
unit tests.
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but should be
more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to go...
Lots of plain-text files have also been converted and integrated.
- Images in binary formats have been replaced with more source-friendly
versions.
- Various bits of organizational work, including the renaming of various
files discussed at the kernel summit.
- New documentation for the device_link mechanism.
...and, of course, lots of typo fixes and small updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
pmSJR0hVeRUmA4uw6+Su
=A0EV
-----END PGP SIGNATURE-----
Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but
should be more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.
- Images in binary formats have been replaced with more
source-friendly versions.
- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.
- New documentation for the device_link mechanism.
... and, of course, lots of typo fixes and small updates"
* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...
Fix the format specifier so that the attribute can be parsed correctly.
Currently it returns decimal 1000 for a 4096-byte alignment.
Cc: <stable@vger.kernel.org>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Fixes: 315c562536 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Given ambiguities in the ACPI 6.1 definition of the "Output (Size)"
field of the ARS (Address Range Scrub) Status command, a firmware
implementation may in practice return 0, 4, or 8 to indicate that there
is no output payload to process.
The specification states "Size of Output Buffer in bytes, including this
field.". However, 'Output Buffer' is also the name of the entire
payload, and earlier in the specification it states "Max Query ARS
Status Output Buffer Size: Maximum size of buffer (including the Status
and Extended Status fields)".
Without this fix if the BIOS happens to return 0 it causes memory
corruption as evidenced by this result from the acpi_nfit_ctl() unit
test.
ars_status00000000: 00020000 00000000 ........
BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff)
kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC
task: ffff8803332d2ec0 task.stack: ffffc9000174c000
RIP: 0010:[<ffffffff814cfe72>] [<ffffffff814cfe72>] __memcpy+0x12/0x20
RSP: 0018:ffffc9000174f9a8 EFLAGS: 00010246
RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56
RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000
RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8
R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0
FS: 00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0
Stack:
ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac
0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000
0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246
Call Trace:
[<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit]
[<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test]
Cc: <stable@vger.kernel.org>
Fixes: 747ffe11b4 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use module_platform_driver for the e820 driver instead of open-coding it.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
According to commit f90774e1fd
("checkpatch: look for symbolic permissions and suggest octal instead")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Use sector_t for cleared
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
else after return is not needed.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[djbw: removed some now unnecessary newlines]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In create_namespace_blk(), the local variable "name" is defined as an
array of NSLABEL_NAME_LEN pointers:
char *name[NSLABEL_NAME_LEN];
This variable is then used in calls to memcpy() and kmemdup() as if it
were char[NSLABEL_NAME_LEN]. Remove the star in the variable definition
to makes it look right.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Here is an example /proc/iomem listing for a system with 2 namespaces,
one in "sector" mode and one in "memory" mode:
1fc000000-2fbffffff : Persistent Memory (legacy)
1fc000000-2fbffffff : namespace1.0
340000000-34fffffff : Persistent Memory
340000000-34fffffff : btt0.1
Here is the corresponding ndctl listing:
# ndctl list
[
{
"dev":"namespace1.0",
"mode":"memory",
"size":4294967296,
"blockdev":"pmem1"
},
{
"dev":"namespace0.0",
"mode":"sector",
"size":267091968,
"uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
"sector_size":4096,
"blockdev":"pmem0s"
}
]
Notice that the ndctl listing is purely in terms of namespace devices,
while the iomem listing leaks the internal "btt0.1" implementation
detail. Given that ndctl requires the namespace device name to change
the mode, for example:
# ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
...use the namespace name in the iomem listing to keep the claiming
device name consistent across different mode settings.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G
KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT
msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX
p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8
FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c
gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI=
=2KUF
-----END PGP SIGNATURE-----
Merge tag 'v4.9-rc4' into sound
Bring in -rc4 patches so I can successfully merge the sound doc changes.
In order to test that the name of a resource begins with "pmem", call
strncmp() with 4 as length instead of 3 to match the whole prefix.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We need to clear any poison when we are writing to pmem. The granularity
will be sector size. If it's less then we can't do anything about it
barring corruption.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
[djbw: fixup 0-length write request to succeed]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A bugfix just tried to address a randconfig build problem and introduced
a variant of the same problem: with CONFIG_LIBNVDIMM=y and
CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link:
drivers/nvdimm/built-in.o: In function `to_nd_device_type':
bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2':
region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax'
region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax'
drivers/nvdimm/built-in.o: In function `nd_region_probe':
region.c:(.text+0x70f3): undefined reference to `nd_dax_create'
drivers/nvdimm/built-in.o: In function `mode_show':
namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa55f): undefined reference to `is_nd_dax'
drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe':
(.text+0xa56e): undefined reference to `to_nd_dax'
This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again
as it should be (it gets linked into the libnvdimm module). To fix
the original problem, I'm adding a dependency on LIBNVDIMM to
DEV_DAX_PMEM, which ensures we can't have that one built-in if the
rest is a module.
Fixes: 4e65e9381c ("/dev/dax: fix Kconfig dependency build breakage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
ACPI Clear Uncorrectable Error DSM function may fail or may be
unsupported on a platform. pmem_clear_poison() returns without clearing
badblocks in such cases. This failure is detected at the next read
(-EIO).
This behavior can lead to an issue when user keeps writing but does not
read immediately. For instance, flight recorder file may be only read
when it is necessary for troubleshooting.
Change pmem_do_bvec() and pmem_clear_poison() to return -EIO so that
filesystem can log an error message on a write error.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If the kcalloc() fails then "devs" can be NULL and we dereference it
checking "devs[i]".
Fixes: 1b40e09a12 ('libnvdimm: blk labels and namespace instantiation')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Platforms like QEMU-KVM implement an NFIT table and label DSMs.
However, since that environment does not define an aliased
configuration, the labels are currently ignored and the kernel registers
a single full-sized pmem-namespace per region. Now that the kernel
supports sub-divisions of pmem regions the labels have a purpose.
Arrange for the labels to be honored when we find an existing / valid
namespace index block.
Cc: <qemu-devel@nongnu.org>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
nd_iostat_start() and nd_iostat_end() implement the same functionality
that generic_start_io_acct() and generic_end_io_acct() already provide.
Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
interfaces. There is no change in the nd interfaces.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The function dax_pmem_probe() in drivers/dax/pmem.c is compiled under the
CONFIG_DEV_DAX_PMEM tri-state config option. This config option currently
only depends on CONFIG_NVDIMM_DAX, a bool, which means that the following
configuration is possible:
CONFIG_LIBNVDIMM=m
...
CONFIG_NVDIMM_DAX=y
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=y
With this config LIBNVDIMM is compiled as a module with NVDIMM_DAX=y just
meaning that we will compile drivers/nvdimm/dax_devs.c into that module.
However, dax_pmem_probe() depends on several symbols defined in
drivers/nvdimm/dax_devs.c, which results in the following build errors:
drivers/built-in.o: In function `dax_pmem_probe':
linux/drivers/dax/pmem.c:70: undefined reference to `to_nd_dax'
linux/drivers/dax/pmem.c:74: undefined reference to
`nvdimm_namespace_common_probe'
linux/drivers/dax/pmem.c:80: undefined reference to `devm_nsio_enable'
linux/drivers/dax/pmem.c:81: undefined reference to `nvdimm_setup_pfn'
linux/drivers/dax/pmem.c:84: undefined reference to `devm_nsio_disable'
linux/drivers/dax/pmem.c:122: undefined reference to `to_nd_region'
drivers/built-in.o: In function `dax_pmem_init':
linux/drivers/dax/pmem.c:147: undefined reference to `__nd_driver_register'
Fix this by making NVDIMM_DAX a tristate. DEV_DAX_PMEM depends on
NVDIMM_DAX which depends on LIBNVDIMM. Since they are all now tristates,
if LIBNVDIMM is built as a kernel module DEV_DAX_PMEM will be as well.
This prevents dax_devs.c from being built as a built-in while its
dependencies are in the libnvdimm.ko module.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Similar to BLK regions, publish new seed namespace devices to allow
unused PMEM region capacity to be consumed by additional namespaces.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that the rest of the infrastructure has been converted to handle
multi-pmem configurations, lift the artificial barrier at scan time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Short-circuit doomed-to-fail label validation attempts by skipping
labels that are outside the given region. For example a DIMM that has
multiple PMEM regions will waste time attempting to create namespaces
only to find that the interleave-set-cookie does not validate, e.g.:
nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2
Similar to how we skip BLK labels when performing PMEM validation we can
skip out-of-range labels early.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that we have nd_region_available_dpa() able to handle the presence
of multiple PMEM allocations in aliased PMEM regions, reuse that same
infrastructure to track allocations from free space. In particular
handle allocating from an aliased PMEM region in the case where there
are dis-contiguous holes. The allocation for BLK and PMEM are
documented in the space_valid() helper:
BLK-space is valid as long as it does not precede a PMEM
allocation in a given region. PMEM-space must be contiguous
and adjacent to an existing existing allocation (if one
exists).
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Instead of assuming that there will only ever be one allocated range at
the start of the region, account for additional namespaces that might
start at an offset from the region base.
After this change pmem namespaces now have a reason to carry an array of
resources similar to blk. Unifying the resource tracking infrastructure
in nd_namespace_common is a future cleanup candidate.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
pmem devices are currently named /dev/pmem<region-index>. Preserve the
naming of the 0th device, but add a ".<namespace-index>" for other
devices.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The free dpa (dimm-physical-address) space calculation reports how much
free space is available with consideration for aliased BLK + PMEM
regions. Recall that BLK capacity is allocated from high addresses and
PMEM is allocated from low addresses in their respective regions.
nd_region_available_dpa() accounts for the fact that the largest
encroachment (lowest starting address) into PMEM capacity by a BLK
allocation limits the available capacity to that point, regardless if
there is BLK allocation hole at a higher address. Similarly, for the
multi-pmem case we need to track the largest encroachment (highest
ending address) of a PMEM allocation in BLK capacity regardless of
whether there is an allocation hole that a BLK allocation could fill at
a lower address.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add more determinism to initial namespace device-name assignments by
sorting the namespaces by starting dpa.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>