2020-01-31 09:12:54 +03:00
|
|
|
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
|
|
|
====================================================
|
|
|
|
pin_user_pages() and related calls
|
|
|
|
====================================================
|
|
|
|
|
|
|
|
.. contents:: :local:
|
|
|
|
|
|
|
|
Overview
|
|
|
|
========
|
|
|
|
|
|
|
|
This document describes the following functions::
|
|
|
|
|
|
|
|
pin_user_pages()
|
|
|
|
pin_user_pages_fast()
|
|
|
|
pin_user_pages_remote()
|
|
|
|
|
|
|
|
Basic description of FOLL_PIN
|
|
|
|
=============================
|
|
|
|
|
|
|
|
FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
|
|
|
|
("gup") family of functions. FOLL_PIN has significant interactions and
|
|
|
|
interdependencies with FOLL_LONGTERM, so both are covered here.
|
|
|
|
|
|
|
|
FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
|
|
|
|
sites. This allows the associated wrapper functions (pin_user_pages*() and
|
|
|
|
others) to set the correct combination of these flags, and to check for problems
|
|
|
|
as well.
|
|
|
|
|
|
|
|
FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
|
|
|
|
This is in order to avoid creating a large number of wrapper functions to cover
|
|
|
|
all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
|
|
|
|
pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
|
|
|
|
that's a natural dividing line, and a good point to make separate wrapper calls.
|
|
|
|
In other words, use pin_user_pages*() for DMA-pinned pages, and
|
2020-06-26 06:30:25 +03:00
|
|
|
get_user_pages*() for other cases. There are five cases described later on in
|
2020-01-31 09:12:54 +03:00
|
|
|
this document, to further clarify that concept.
|
|
|
|
|
|
|
|
FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
|
|
|
|
multiple threads and call sites are free to pin the same struct pages, via both
|
|
|
|
FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
|
|
|
|
other, not the struct page(s).
|
|
|
|
|
|
|
|
The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
|
|
|
|
uses a different reference counting technique.
|
|
|
|
|
|
|
|
FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
|
|
|
|
FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
|
|
|
|
|
|
|
|
Which flags are set by each wrapper
|
|
|
|
===================================
|
|
|
|
|
|
|
|
For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
|
|
|
|
flags the caller provides. The caller is required to pass in a non-null struct
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:33 +03:00
|
|
|
pages* array, and the function then pins pages by incrementing each by a special
|
|
|
|
value: GUP_PIN_COUNTING_BIAS.
|
|
|
|
|
2022-01-07 00:46:43 +03:00
|
|
|
For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
|
|
|
|
an exact form of pin counting is achieved, by using the 2nd struct page
|
|
|
|
in the compound page. A new struct page field, compound_pincount, has
|
|
|
|
been added in order to support this.
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:33 +03:00
|
|
|
|
|
|
|
This approach for compound pages avoids the counting upper limit problems that
|
|
|
|
are discussed below. Those limitations would have been aggravated severely by
|
|
|
|
huge pages, because each tail page adds a refcount to the head page. And in
|
2022-01-07 00:46:43 +03:00
|
|
|
fact, testing revealed that, without a separate compound_pincount field,
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:33 +03:00
|
|
|
page overflows were seen in some huge page stress tests.
|
|
|
|
|
2022-01-07 00:46:43 +03:00
|
|
|
This also means that huge pages and compound pages do not suffer
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:33 +03:00
|
|
|
from the false positives problem that is mentioned below.::
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
Function
|
|
|
|
--------
|
|
|
|
pin_user_pages FOLL_PIN is always set internally by this function.
|
|
|
|
pin_user_pages_fast FOLL_PIN is always set internally by this function.
|
|
|
|
pin_user_pages_remote FOLL_PIN is always set internally by this function.
|
|
|
|
|
|
|
|
For these get_user_pages*() functions, FOLL_GET might not even be specified.
|
|
|
|
Behavior is a little more complex than above. If FOLL_GET was *not* specified,
|
|
|
|
but the caller passed in a non-null struct pages* array, then the function
|
|
|
|
sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
|
|
|
|
of each page by +1.::
|
|
|
|
|
|
|
|
Function
|
|
|
|
--------
|
|
|
|
get_user_pages FOLL_GET is sometimes set internally by this function.
|
|
|
|
get_user_pages_fast FOLL_GET is sometimes set internally by this function.
|
|
|
|
get_user_pages_remote FOLL_GET is sometimes set internally by this function.
|
|
|
|
|
|
|
|
Tracking dma-pinned pages
|
|
|
|
=========================
|
|
|
|
|
|
|
|
Some of the key design constraints, and solutions, for tracking dma-pinned
|
|
|
|
pages:
|
|
|
|
|
|
|
|
* An actual reference count, per struct page, is required. This is because
|
|
|
|
multiple processes may pin and unpin a page.
|
|
|
|
|
|
|
|
* False positives (reporting that a page is dma-pinned, when in fact it is not)
|
|
|
|
are acceptable, but false negatives are not.
|
|
|
|
|
|
|
|
* struct page may not be increased in size for this, and all fields are already
|
|
|
|
used.
|
|
|
|
|
|
|
|
* Given the above, we can overload the page->_refcount field by using, sort of,
|
|
|
|
the upper bits in that field for a dma-pinned count. "Sort of", means that,
|
|
|
|
rather than dividing page->_refcount into bit fields, we simple add a medium-
|
|
|
|
large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
|
|
|
|
page->_refcount. This provides fuzzy behavior: if a page has get_page() called
|
|
|
|
on it 1024 times, then it will appear to have a single dma-pinned count.
|
|
|
|
And again, that's acceptable.
|
|
|
|
|
|
|
|
This also leads to limitations: there are only 31-10==21 bits available for a
|
|
|
|
counter that increments 10 bits at a time.
|
|
|
|
|
|
|
|
* Callers must specifically request "dma-pinned tracking of pages". In other
|
|
|
|
words, just calling get_user_pages() will not suffice; a new set of functions,
|
|
|
|
pin_user_page() and related, must be used.
|
|
|
|
|
|
|
|
FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
|
|
|
|
==========================================================
|
|
|
|
|
|
|
|
Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
|
|
|
|
these categories:
|
|
|
|
|
|
|
|
CASE 1: Direct IO (DIO)
|
|
|
|
-----------------------
|
|
|
|
There are GUP references to pages that are serving
|
|
|
|
as DIO buffers. These buffers are needed for a relatively short time (so they
|
|
|
|
are not "long term"). No special synchronization with page_mkclean() or
|
|
|
|
munmap() is provided. Therefore, flags to set at the call site are: ::
|
|
|
|
|
|
|
|
FOLL_PIN
|
|
|
|
|
|
|
|
...but rather than setting FOLL_PIN directly, call sites should use one of
|
|
|
|
the pin_user_pages*() routines that set FOLL_PIN.
|
|
|
|
|
|
|
|
CASE 2: RDMA
|
|
|
|
------------
|
|
|
|
There are GUP references to pages that are serving as DMA
|
|
|
|
buffers. These buffers are needed for a long time ("long term"). No special
|
|
|
|
synchronization with page_mkclean() or munmap() is provided. Therefore, flags
|
|
|
|
to set at the call site are: ::
|
|
|
|
|
|
|
|
FOLL_PIN | FOLL_LONGTERM
|
|
|
|
|
|
|
|
NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
|
|
|
|
because DAX pages do not have a separate page cache, and so "pinning" implies
|
|
|
|
locking down file system blocks, which is not (yet) supported in that way.
|
|
|
|
|
2020-06-08 07:40:59 +03:00
|
|
|
CASE 3: MMU notifier registration, with or without page faulting hardware
|
|
|
|
-------------------------------------------------------------------------
|
|
|
|
Device drivers can pin pages via get_user_pages*(), and register for mmu
|
|
|
|
notifier callbacks for the memory range. Then, upon receiving a notifier
|
|
|
|
"invalidate range" callback , stop the device from using the range, and unpin
|
|
|
|
the pages. There may be other possible schemes, such as for example explicitly
|
|
|
|
synchronizing against pending IO, that accomplish approximately the same thing.
|
|
|
|
|
|
|
|
Or, if the hardware supports replayable page faults, then the device driver can
|
|
|
|
avoid pinning entirely (this is ideal), as follows: register for mmu notifier
|
|
|
|
callbacks as above, but instead of stopping the device and unpinning in the
|
|
|
|
callback, simply remove the range from the device's page tables.
|
|
|
|
|
|
|
|
Either way, as long as the driver unpins the pages upon mmu notifier callback,
|
|
|
|
then there is proper synchronization with both filesystem and mm
|
|
|
|
(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
CASE 4: Pinning for struct page manipulation only
|
|
|
|
-------------------------------------------------
|
2020-06-08 07:40:59 +03:00
|
|
|
If only struct page data (as opposed to the actual memory contents that a page
|
|
|
|
is tracking) is affected, then normal GUP calls are sufficient, and neither flag
|
|
|
|
needs to be set.
|
2020-01-31 09:12:54 +03:00
|
|
|
|
docs: mm/gup: pin_user_pages.rst: add a "case 5"
Patch series "vhost, docs: convert to pin_user_pages(), new "case 5""
It recently became clear to me that there are some get_user_pages*()
callers that don't fit neatly into any of the four cases that are so far
listed in pin_user_pages.rst. vhost.c is one of those.
Add a Case 5 to the documentation, and refer to that when converting
vhost.c.
Thanks to Jan Kara for helping me (again) in understanding the
interaction between get_user_pages() and page writeback [1].
This is based on today's mmotm, which has a nearby patch to
pin_user_pages.rst that rewords cases 3 and 4.
Note that I have only compile-tested the vhost.c patch, although that
does also include cross-compiling for a few other arches. Any run-time
testing would be greatly appreciated.
[1] https://lore.kernel.org/r/20200529070343.GL14550@quack2.suse.cz
This patch (of 2):
There are four cases listed in pin_user_pages.rst. These are intended
to help developers figure out whether to use get_user_pages*(), or
pin_user_pages*(). However, the four cases do not cover all the
situations. For example, drivers/vhost/vhost.c has a "pin, write to
page, set page dirty, unpin" case.
Add a fifth case, to help explain that there is a general pattern that
requires pin_user_pages*() API calls.
[jhubbard@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200601052633.853874-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Link: http://lkml.kernel.org/r/20200529234309.484480-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20200529234309.484480-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 07:41:11 +03:00
|
|
|
CASE 5: Pinning in order to write to the data within the page
|
|
|
|
-------------------------------------------------------------
|
|
|
|
Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
|
|
|
|
write to a page's data, unpin" can cause a problem. Case 5 may be considered a
|
|
|
|
superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
|
|
|
|
other words, if the code is neither Case 1 nor Case 2, it may still require
|
|
|
|
FOLL_PIN, for patterns like this:
|
|
|
|
|
|
|
|
Correct (uses FOLL_PIN calls):
|
|
|
|
pin_user_pages()
|
|
|
|
write to the data within the pages
|
|
|
|
unpin_user_pages()
|
|
|
|
|
|
|
|
INCORRECT (uses FOLL_GET calls):
|
|
|
|
get_user_pages()
|
|
|
|
write to the data within the pages
|
|
|
|
put_page()
|
|
|
|
|
mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:29 +03:00
|
|
|
page_maybe_dma_pinned(): the whole point of pinning
|
|
|
|
===================================================
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
|
|
|
|
to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
|
|
|
|
(and file system writeback code in general) to make informed decisions about
|
|
|
|
what to do when a page cannot be unmapped due to such pins.
|
|
|
|
|
|
|
|
What to do in those cases is the subject of a years-long series of discussions
|
|
|
|
and debates (see the References at the end of this document). It's a TODO item
|
|
|
|
here: fill in the details once that's worked out. Meanwhile, it's safe to say
|
|
|
|
that having this available: ::
|
|
|
|
|
mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:29 +03:00
|
|
|
static inline bool page_maybe_dma_pinned(struct page *page)
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
...is a prerequisite to solving the long-running gup+DMA problem.
|
|
|
|
|
|
|
|
Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
|
|
|
|
===================================================================
|
|
|
|
|
|
|
|
Another way of thinking about these flags is as a progression of restrictions:
|
|
|
|
FOLL_GET is for struct page manipulation, without affecting the data that the
|
|
|
|
struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
|
|
|
|
short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
|
|
|
|
a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
|
|
|
|
restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
|
|
|
|
will be pinned longterm, and whose data will be accessed.
|
|
|
|
|
|
|
|
Unit testing
|
|
|
|
============
|
|
|
|
This file::
|
|
|
|
|
mm/gup_benchmark: rename to mm/gup_test
Patch series "selftests/vm: gup_test, hmm-tests, assorted improvements", v3.
Summary: This series provides two main things, and a number of smaller
supporting goodies. The two main points are:
1) Add a new sub-test to gup_test, which in turn is a renamed version
of gup_benchmark. This sub-test allows nicer testing of dump_pages(),
at least on user-space pages.
For quite a while, I was doing a quick hack to gup_test.c whenever I
wanted to try out changes to dump_page(). Then Matthew Wilcox asked me
what I meant when I said "I used my dump_page() unit test", and I
realized that it might be nice to check in a polished up version of
that.
Details about how it works and how to use it are in the commit
description for patch #6 ("selftests/vm: gup_test: introduce the
dump_pages() sub-test").
2) Fixes a limitation of hmm-tests: these tests are incredibly useful,
but only if people actually build and run them. And it turns out that
libhugetlbfs is a little too effective at throwing a wrench in the
works, there. So I've added a little configuration check that removes
just two of the 21 hmm-tests, if libhugetlbfs is not available.
Further details in the commit description of patch #8
("selftests/vm: hmm-tests: remove the libhugetlbfs dependency").
Other smaller things that this series does:
a) Remove code duplication by creating gup_test.h.
b) Clear up the sub-test organization, and their invocation within
run_vmtests.sh.
c) Other minor assorted improvements.
[1] v2 is here:
https://lore.kernel.org/linux-doc/20200929212747.251804-1-jhubbard@nvidia.com/
[2] https://lore.kernel.org/r/CAHk-=wgh-TMPHLY3jueHX7Y2fWh3D+nMBqVS__AZm6-oorquWA@mail.gmail.com
This patch (of 9):
Rename nearly every "gup_benchmark" reference and file name to "gup_test".
The one exception is for the actual gup benchmark test itself.
The current code already does a *little* bit more than benchmarking, and
definitely covers more than get_user_pages_fast(). More importantly,
however, subsequent patches are about to add some functionality that is
non-benchmark related.
Closely related changes:
* Kconfig: in addition to renaming the options from GUP_BENCHMARK to
GUP_TEST, update the help text to reflect that it's no longer a
benchmark-only test.
Link: https://lkml.kernel.org/r/20201026064021.3545418-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20201026064021.3545418-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 06:05:05 +03:00
|
|
|
tools/testing/selftests/vm/gup_test.c
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
has the following new calls to exercise the new pin*() wrapper functions:
|
|
|
|
|
mm/gup_benchmark: rename to mm/gup_test
Patch series "selftests/vm: gup_test, hmm-tests, assorted improvements", v3.
Summary: This series provides two main things, and a number of smaller
supporting goodies. The two main points are:
1) Add a new sub-test to gup_test, which in turn is a renamed version
of gup_benchmark. This sub-test allows nicer testing of dump_pages(),
at least on user-space pages.
For quite a while, I was doing a quick hack to gup_test.c whenever I
wanted to try out changes to dump_page(). Then Matthew Wilcox asked me
what I meant when I said "I used my dump_page() unit test", and I
realized that it might be nice to check in a polished up version of
that.
Details about how it works and how to use it are in the commit
description for patch #6 ("selftests/vm: gup_test: introduce the
dump_pages() sub-test").
2) Fixes a limitation of hmm-tests: these tests are incredibly useful,
but only if people actually build and run them. And it turns out that
libhugetlbfs is a little too effective at throwing a wrench in the
works, there. So I've added a little configuration check that removes
just two of the 21 hmm-tests, if libhugetlbfs is not available.
Further details in the commit description of patch #8
("selftests/vm: hmm-tests: remove the libhugetlbfs dependency").
Other smaller things that this series does:
a) Remove code duplication by creating gup_test.h.
b) Clear up the sub-test organization, and their invocation within
run_vmtests.sh.
c) Other minor assorted improvements.
[1] v2 is here:
https://lore.kernel.org/linux-doc/20200929212747.251804-1-jhubbard@nvidia.com/
[2] https://lore.kernel.org/r/CAHk-=wgh-TMPHLY3jueHX7Y2fWh3D+nMBqVS__AZm6-oorquWA@mail.gmail.com
This patch (of 9):
Rename nearly every "gup_benchmark" reference and file name to "gup_test".
The one exception is for the actual gup benchmark test itself.
The current code already does a *little* bit more than benchmarking, and
definitely covers more than get_user_pages_fast(). More importantly,
however, subsequent patches are about to add some functionality that is
non-benchmark related.
Closely related changes:
* Kconfig: in addition to renaming the options from GUP_BENCHMARK to
GUP_TEST, update the help text to reflect that it's no longer a
benchmark-only test.
Link: https://lkml.kernel.org/r/20201026064021.3545418-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20201026064021.3545418-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 06:05:05 +03:00
|
|
|
* PIN_FAST_BENCHMARK (./gup_test -a)
|
2020-12-15 06:05:17 +03:00
|
|
|
* PIN_BASIC_TEST (./gup_test -b)
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
You can monitor how many total dma-pinned pages have been acquired and released
|
|
|
|
since the system was booted, via two new /proc/vmstat entries: ::
|
|
|
|
|
2020-04-02 07:05:37 +03:00
|
|
|
/proc/vmstat/nr_foll_pin_acquired
|
|
|
|
/proc/vmstat/nr_foll_pin_released
|
2020-01-31 09:12:54 +03:00
|
|
|
|
2020-04-02 07:05:37 +03:00
|
|
|
Under normal conditions, these two values will be equal unless there are any
|
|
|
|
long-term [R]DMA pins in place, or during pin/unpin transitions.
|
|
|
|
|
|
|
|
* nr_foll_pin_acquired: This is the number of logical pins that have been
|
|
|
|
acquired since the system was powered on. For huge pages, the head page is
|
|
|
|
pinned once for each page (head page and each tail page) within the huge page.
|
|
|
|
This follows the same sort of behavior that get_user_pages() uses for huge
|
|
|
|
pages: the head page is refcounted once for each tail or head page in the huge
|
|
|
|
page, when get_user_pages() is applied to a huge page.
|
|
|
|
|
|
|
|
* nr_foll_pin_released: The number of logical pins that have been released since
|
|
|
|
the system was powered on. Note that pages are released (unpinned) on a
|
|
|
|
PAGE_SIZE granularity, even if the original pin was applied to a huge page.
|
|
|
|
Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
|
|
|
|
the accounting balances out, so that after doing this::
|
|
|
|
|
|
|
|
pin_user_pages(huge_page);
|
|
|
|
for (each page in huge_page)
|
|
|
|
unpin_user_page(page);
|
|
|
|
|
|
|
|
...the following is expected::
|
|
|
|
|
|
|
|
nr_foll_pin_released == nr_foll_pin_acquired
|
|
|
|
|
|
|
|
(...unless it was already out of balance due to a long-term RDMA pin being in
|
|
|
|
place.)
|
2020-01-31 09:12:54 +03:00
|
|
|
|
2020-04-02 07:05:52 +03:00
|
|
|
Other diagnostics
|
|
|
|
=================
|
|
|
|
|
2022-01-07 00:46:43 +03:00
|
|
|
dump_page() has been enhanced slightly, to handle these new counting
|
|
|
|
fields, and to better report on compound pages in general. Specifically,
|
|
|
|
for compound pages, the exact (compound_pincount) pincount is reported.
|
2020-04-02 07:05:52 +03:00
|
|
|
|
2020-01-31 09:12:54 +03:00
|
|
|
References
|
|
|
|
==========
|
|
|
|
|
|
|
|
* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
|
|
|
|
* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
|
|
|
|
* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:05:33 +03:00
|
|
|
* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
|
2020-01-31 09:12:54 +03:00
|
|
|
|
|
|
|
John Hubbard, October, 2019
|