License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifndef _LINUX_HUGETLB_H
|
|
|
|
#define _LINUX_HUGETLB_H
|
|
|
|
|
2011-05-26 23:03:50 +04:00
|
|
|
#include <linux/mm_types.h>
|
2014-01-24 03:52:54 +04:00
|
|
|
#include <linux/mmdebug.h>
|
2007-07-30 02:36:13 +04:00
|
|
|
#include <linux/fs.h>
|
2010-05-28 04:29:15 +04:00
|
|
|
#include <linux/hugetlb_inline.h>
|
2012-08-01 03:42:24 +04:00
|
|
|
#include <linux/cgroup.h>
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:25 +04:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/kref.h>
|
2020-06-09 07:32:38 +03:00
|
|
|
#include <linux/pgtable.h>
|
2020-08-12 04:37:17 +03:00
|
|
|
#include <linux/gfp.h>
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 04:35:49 +03:00
|
|
|
#include <linux/userfaultfd_k.h>
|
2007-07-30 02:36:13 +04:00
|
|
|
|
2009-09-25 01:47:45 +04:00
|
|
|
struct ctl_table;
|
|
|
|
struct user_struct;
|
2012-08-01 03:42:03 +04:00
|
|
|
struct mmu_gather;
|
2009-09-25 01:47:45 +04:00
|
|
|
|
2017-07-07 01:38:53 +03:00
|
|
|
#ifndef is_hugepd
|
|
|
|
typedef struct { unsigned long pd; } hugepd_t;
|
|
|
|
#define is_hugepd(hugepd) (0)
|
|
|
|
#define __hugepd(x) ((hugepd_t) { (x) })
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
|
|
|
|
#include <linux/mempolicy.h>
|
2007-03-02 02:46:08 +03:00
|
|
|
#include <linux/shm.h>
|
2005-06-22 04:14:44 +04:00
|
|
|
#include <asm/tlbflush.h>
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2021-07-01 04:47:09 +03:00
|
|
|
/*
|
|
|
|
* For HugeTLB page, there are more metadata to save in the struct page. But
|
|
|
|
* the head struct page cannot meet our needs, so we have to abuse other tail
|
|
|
|
* struct page to store the metadata. In order to avoid conflicts caused by
|
|
|
|
* subsequent use of more tail struct pages, we gather these discrete indexes
|
|
|
|
* of tail struct page here.
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
SUBPAGE_INDEX_SUBPOOL = 1, /* reuse page->private */
|
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
SUBPAGE_INDEX_CGROUP, /* reuse page->private */
|
|
|
|
SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
|
|
|
|
__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
|
|
|
|
#endif
|
|
|
|
__NR_USED_SUBPAGE,
|
|
|
|
};
|
|
|
|
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:12 +04:00
|
|
|
struct hugepage_subpool {
|
|
|
|
spinlock_t lock;
|
|
|
|
long count;
|
hugetlbfs: add minimum size tracking fields to subpool structure
hugetlbfs allocates huge pages from the global pool as needed. Even if
the global pool contains a sufficient number pages for the filesystem size
at mount time, those global pages could be grabbed for some other use. As
a result, filesystem huge page allocations may fail due to lack of pages.
Applications such as a database want to use huge pages for performance
reasons. hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages. However, the application
would like some reasonable assurance that allocations will not fail due to
a lack of huge pages. At application startup time, the application would
like to configure itself to use a specific number of huge pages. Before
starting, the application can check to make sure that enough huge pages
exist in the system global pools. However, there are no guarantees that
those pages will be available when needed by the application. What the
application wants is exclusive use of a subset of huge pages.
Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
specified number of pages will be available for use by the filesystem. At
mount time, this number of huge pages will be reserved for exclusive use
of the filesystem. If there is not a sufficient number of free pages, the
mount will fail. As pages are allocated to and freeed from the
filesystem, the number of reserved pages is adjusted so that the specified
minimum is maintained.
This patch (of 4):
Add a field to the subpool structure to indicate the minimimum number of
huge pages to always be used by this subpool. This minimum count includes
allocated pages as well as reserved pages. If the minimum number of pages
for the subpool have not been allocated, pages are reserved up to this
minimum. An additional field (rsv_hpages) is used to track the number of
pages reserved to meet this minimum size. The hstate pointer in the
subpool is convenient to have when reserving and unreserving the pages.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 02:13:36 +03:00
|
|
|
long max_hpages; /* Maximum huge pages or -1 if no maximum. */
|
|
|
|
long used_hpages; /* Used count against maximum, includes */
|
2021-07-08 04:08:19 +03:00
|
|
|
/* both allocated and reserved pages. */
|
hugetlbfs: add minimum size tracking fields to subpool structure
hugetlbfs allocates huge pages from the global pool as needed. Even if
the global pool contains a sufficient number pages for the filesystem size
at mount time, those global pages could be grabbed for some other use. As
a result, filesystem huge page allocations may fail due to lack of pages.
Applications such as a database want to use huge pages for performance
reasons. hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages. However, the application
would like some reasonable assurance that allocations will not fail due to
a lack of huge pages. At application startup time, the application would
like to configure itself to use a specific number of huge pages. Before
starting, the application can check to make sure that enough huge pages
exist in the system global pools. However, there are no guarantees that
those pages will be available when needed by the application. What the
application wants is exclusive use of a subset of huge pages.
Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
specified number of pages will be available for use by the filesystem. At
mount time, this number of huge pages will be reserved for exclusive use
of the filesystem. If there is not a sufficient number of free pages, the
mount will fail. As pages are allocated to and freeed from the
filesystem, the number of reserved pages is adjusted so that the specified
minimum is maintained.
This patch (of 4):
Add a field to the subpool structure to indicate the minimimum number of
huge pages to always be used by this subpool. This minimum count includes
allocated pages as well as reserved pages. If the minimum number of pages
for the subpool have not been allocated, pages are reserved up to this
minimum. An additional field (rsv_hpages) is used to track the number of
pages reserved to meet this minimum size. The hstate pointer in the
subpool is convenient to have when reserving and unreserving the pages.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 02:13:36 +03:00
|
|
|
struct hstate *hstate;
|
|
|
|
long min_hpages; /* Minimum huge pages or -1 if no minimum. */
|
|
|
|
long rsv_hpages; /* Pages reserved against global pool to */
|
2021-02-24 23:07:19 +03:00
|
|
|
/* satisfy minimum size. */
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:12 +04:00
|
|
|
};
|
|
|
|
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:25 +04:00
|
|
|
struct resv_map {
|
|
|
|
struct kref refs;
|
2014-04-04 01:47:27 +04:00
|
|
|
spinlock_t lock;
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:25 +04:00
|
|
|
struct list_head regions;
|
2015-09-09 01:01:28 +03:00
|
|
|
long adds_in_progress;
|
|
|
|
struct list_head region_cache;
|
|
|
|
long region_cache_count;
|
2020-04-02 07:11:21 +03:00
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/*
|
|
|
|
* On private mappings, the counter to uncharge reservations is stored
|
|
|
|
* here. If these fields are 0, then either the mapping is shared, or
|
|
|
|
* cgroup accounting is disabled for this resv_map.
|
|
|
|
*/
|
|
|
|
struct page_counter *reservation_counter;
|
|
|
|
unsigned long pages_per_hpage;
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
#endif
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:25 +04:00
|
|
|
};
|
hugetlb_cgroup: add accounting for shared mappings
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.
[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:11:28 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Region tracking -- allows tracking of reservations and instantiated pages
|
|
|
|
* across the pages in a mapping.
|
|
|
|
*
|
|
|
|
* The region data structures are embedded into a resv_map and protected
|
|
|
|
* by a resv_map's lock. The set of regions within the resv_map represent
|
|
|
|
* reservations for huge pages, or huge pages that have already been
|
|
|
|
* instantiated within the map. The from and to elements are huge page
|
2021-07-08 04:08:19 +03:00
|
|
|
* indices into the associated mapping. from indicates the starting index
|
hugetlb_cgroup: add accounting for shared mappings
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.
[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:11:28 +03:00
|
|
|
* of the region. to represents the first index past the end of the region.
|
|
|
|
*
|
|
|
|
* For example, a file region structure with from == 0 and to == 4 represents
|
|
|
|
* four huge pages in a mapping. It is important to note that the to element
|
|
|
|
* represents the first element past the end of the region. This is used in
|
|
|
|
* arithmetic as 4(to) - 0(from) = 4 huge pages in the region.
|
|
|
|
*
|
|
|
|
* Interval notation of the form [from, to) will be used to indicate that
|
|
|
|
* the endpoint from is inclusive and to is exclusive.
|
|
|
|
*/
|
|
|
|
struct file_region {
|
|
|
|
struct list_head link;
|
|
|
|
long from;
|
|
|
|
long to;
|
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/*
|
|
|
|
* On shared mappings, each reserved region appears as a struct
|
|
|
|
* file_region in resv_map. These fields hold the info needed to
|
|
|
|
* uncharge each reservation.
|
|
|
|
*/
|
|
|
|
struct page_counter *reservation_counter;
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:47:25 +04:00
|
|
|
extern struct resv_map *resv_map_alloc(void);
|
|
|
|
void resv_map_release(struct kref *ref);
|
|
|
|
|
2012-08-01 03:42:10 +04:00
|
|
|
extern spinlock_t hugetlb_lock;
|
|
|
|
extern int hugetlb_max_hstate __read_mostly;
|
|
|
|
#define for_each_hstate(h) \
|
|
|
|
for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
|
|
|
|
|
2015-04-16 02:13:42 +03:00
|
|
|
struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
|
|
|
|
long min_hpages);
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:12 +04:00
|
|
|
void hugepage_put_subpool(struct hugepage_subpool *spool);
|
|
|
|
|
2008-07-24 08:27:23 +04:00
|
|
|
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
|
2021-11-05 23:41:40 +03:00
|
|
|
void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
|
2020-04-24 09:43:38 +03:00
|
|
|
int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
|
|
|
|
int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
|
|
|
int hugetlb_treat_movable_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
|
|
|
int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 04:58:21 +03:00
|
|
|
|
2021-11-05 23:41:40 +03:00
|
|
|
int move_hugetlb_page_tables(struct vm_area_struct *vma,
|
|
|
|
struct vm_area_struct *new_vma,
|
|
|
|
unsigned long old_addr, unsigned long new_addr,
|
|
|
|
unsigned long len);
|
2005-04-17 02:20:36 +04:00
|
|
|
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
|
2013-02-23 04:35:55 +04:00
|
|
|
long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
|
|
|
|
struct page **, struct vm_area_struct **,
|
2017-02-23 02:43:13 +03:00
|
|
|
unsigned long *, unsigned long *, long, unsigned int,
|
|
|
|
int *);
|
hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.
We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.
Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.
1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed
This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.
To summarise the new behaviour:
1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.
2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.
2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.
If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().
[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 08:27:25 +04:00
|
|
|
void unmap_hugepage_range(struct vm_area_struct *,
|
2012-08-01 03:42:03 +04:00
|
|
|
unsigned long, unsigned long, struct page *);
|
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.
This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;
[ ..........] Lots of bad pmd messages followed by this
[ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[ 127.186778] ------------[ cut here ]------------
[ 127.186781] kernel BUG at mm/filemap.c:134!
[ 127.186782] invalid opcode: 0000 [#1] SMP
[ 127.186783] CPU 7
[ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[ 127.186801]
[ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
[ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[ 127.186821] Stack:
[ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[ 127.186827] Call Trace:
[ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
[ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
[ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
[ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
[ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
[ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
[ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
[ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
[ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186870] RSP <ffff8804144b5c08>
[ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.
$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done
On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted. The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON. shutdown will also hang and needs a
hard reset or a sysrq-b.
The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.
Unfortunately it appears to be also insufficient. Consider the following
situation
Process A Process B
--------- ---------
hugetlb_fault shmdt
LockWrite(mmap_sem)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_hugepage_range
Lock(i_mmap_mutex)
Lock(mm->page_table_lock)
huge_pmd_unshare/unmap tables <--- (1)
Unlock(mm->page_table_lock)
Unlock(i_mmap_mutex)
huge_pte_alloc ...
Lock(i_mmap_mutex) ...
vma_prio_walk, find svma, spte ...
Lock(mm->page_table_lock) ...
share spte ...
Unlock(mm->page_table_lock) ...
Unlock(i_mmap_mutex) ...
hugetlb_no_page <--- (2)
free_pgtables
unlink_file_vma
hugetlb_free_pgd_range
remove_vma_list
In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down. The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.
This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed. This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
ineligible for sharing and avoids the race. Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).
This should be treated as a -stable candidate if it is merged.
Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.
==== CUT HERE ====
static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;
unsigned int get_random(unsigned int max)
{
struct timeval tv;
gettimeofday(&tv, NULL);
srandom(tv.tv_usec);
return random() % max;
}
static void play(void *addr, size_t size)
{
unsigned char *start = addr,
*end = start + size,
*a;
start += get_random(size/2);
/* we could itterate on huge pages but let's give it more time. */
for (a = start; a < end; a += 4096)
*a = 0;
}
int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;
if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}
[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:46:20 +04:00
|
|
|
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long start, unsigned long end,
|
|
|
|
struct page *ref_page);
|
2008-10-15 23:50:22 +04:00
|
|
|
void hugetlb_report_meminfo(struct seq_file *);
|
2020-09-16 23:40:43 +03:00
|
|
|
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
|
2013-04-30 02:07:48 +04:00
|
|
|
void hugetlb_show_meminfo(void);
|
2005-04-17 02:20:36 +04:00
|
|
|
unsigned long hugetlb_total_pages(void);
|
2018-08-24 03:01:36 +03:00
|
|
|
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
|
2009-06-23 16:49:05 +04:00
|
|
|
unsigned long address, unsigned int flags);
|
2021-05-05 04:35:45 +03:00
|
|
|
#ifdef CONFIG_USERFAULTFD
|
2017-02-23 02:42:52 +03:00
|
|
|
int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
|
|
|
|
struct vm_area_struct *dst_vma,
|
|
|
|
unsigned long dst_addr,
|
|
|
|
unsigned long src_addr,
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 04:35:49 +03:00
|
|
|
enum mcopy_atomic_mode mode,
|
2017-02-23 02:42:52 +03:00
|
|
|
struct page **pagep);
|
2021-05-05 04:35:45 +03:00
|
|
|
#endif /* CONFIG_USERFAULTFD */
|
2021-02-24 23:09:54 +03:00
|
|
|
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
|
2009-02-10 17:02:27 +03:00
|
|
|
struct vm_area_struct *vma,
|
2011-05-26 14:16:19 +04:00
|
|
|
vm_flags_t vm_flags);
|
2015-09-09 01:01:41 +03:00
|
|
|
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
|
|
|
|
long freed);
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 01:21:59 +04:00
|
|
|
bool isolate_huge_page(struct page *page, struct list_head *list);
|
2021-06-16 04:23:13 +03:00
|
|
|
int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
|
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22 02:35:33 +03:00
|
|
|
int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 01:21:59 +04:00
|
|
|
void putback_active_hugepage(struct page *page);
|
2018-02-01 03:20:48 +03:00
|
|
|
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
|
2014-07-31 03:08:39 +04:00
|
|
|
void free_huge_page(struct page *page);
|
2016-10-08 03:02:01 +03:00
|
|
|
void hugetlb_fix_reserve_counts(struct inode *inode);
|
2015-09-09 01:01:35 +03:00
|
|
|
extern struct mutex *hugetlb_fault_mutex_table;
|
2019-12-01 04:57:02 +03:00
|
|
|
u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2021-05-05 04:33:00 +03:00
|
|
|
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pud_t *pud);
|
2013-04-23 15:35:02 +04:00
|
|
|
|
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:11:05 +03:00
|
|
|
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
extern int sysctl_hugetlb_shm_group;
|
2008-07-24 08:27:52 +04:00
|
|
|
extern struct list_head huge_boot_pages;
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2005-06-22 04:14:44 +04:00
|
|
|
/* arch callbacks */
|
|
|
|
|
2021-05-05 04:33:00 +03:00
|
|
|
pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
|
2008-07-24 08:27:41 +04:00
|
|
|
unsigned long addr, unsigned long sz);
|
2017-07-07 01:39:42 +03:00
|
|
|
pte_t *huge_pte_offset(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long sz);
|
2020-08-12 04:31:38 +03:00
|
|
|
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
|
|
|
|
unsigned long *addr, pte_t *ptep);
|
2018-10-06 01:51:29 +03:00
|
|
|
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
|
|
|
|
unsigned long *start, unsigned long *end);
|
2005-06-22 04:14:44 +04:00
|
|
|
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
|
|
|
|
int write);
|
2017-07-07 01:38:56 +03:00
|
|
|
struct page *follow_huge_pd(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, hugepd_t hpd,
|
|
|
|
int flags, int pdshift);
|
2005-06-22 04:14:44 +04:00
|
|
|
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 02:25:22 +03:00
|
|
|
pmd_t *pmd, int flags);
|
2008-07-24 08:27:50 +04:00
|
|
|
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 02:25:22 +03:00
|
|
|
pud_t *pud, int flags);
|
2017-07-07 01:38:50 +03:00
|
|
|
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
|
|
|
|
pgd_t *pgd, int flags);
|
|
|
|
|
2005-06-22 04:14:44 +04:00
|
|
|
int pmd_huge(pmd_t pmd);
|
2017-03-09 17:24:07 +03:00
|
|
|
int pud_huge(pud_t pud);
|
2012-11-19 06:14:23 +04:00
|
|
|
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
|
2006-03-22 11:08:50 +03:00
|
|
|
unsigned long address, unsigned long end, pgprot_t newprot);
|
2005-06-22 04:14:44 +04:00
|
|
|
|
2017-07-07 01:38:47 +03:00
|
|
|
bool is_hugetlb_entry_migration(pte_t pte);
|
2021-05-05 04:33:13 +03:00
|
|
|
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
|
2018-02-01 03:20:48 +03:00
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#else /* !CONFIG_HUGETLB_PAGE */
|
|
|
|
|
2008-07-24 08:27:23 +04:00
|
|
|
static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2021-11-05 23:41:40 +03:00
|
|
|
static inline void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
static inline unsigned long hugetlb_total_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:11:05 +03:00
|
|
|
static inline struct address_space *hugetlb_page_mapping_lock_write(
|
|
|
|
struct page *hpage)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2020-08-12 04:31:38 +03:00
|
|
|
static inline int huge_pmd_unshare(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long *addr, pte_t *ptep)
|
2018-10-06 01:51:29 +03:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void adjust_range_if_pmd_sharing_possible(
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long *start, unsigned long *end)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2019-12-01 04:56:40 +03:00
|
|
|
static inline long follow_hugetlb_page(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma, struct page **pages,
|
|
|
|
struct vm_area_struct **vmas, unsigned long *position,
|
|
|
|
unsigned long *nr_pages, long i, unsigned int flags,
|
|
|
|
int *nonblocking)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_addr(struct mm_struct *mm,
|
|
|
|
unsigned long address, int write)
|
|
|
|
{
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
|
|
|
|
struct mm_struct *src, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-11-05 23:41:40 +03:00
|
|
|
static inline int move_hugetlb_page_tables(struct vm_area_struct *vma,
|
|
|
|
struct vm_area_struct *new_vma,
|
|
|
|
unsigned long old_addr,
|
|
|
|
unsigned long new_addr,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-10-15 23:50:22 +04:00
|
|
|
static inline void hugetlb_report_meminfo(struct seq_file *m)
|
|
|
|
{
|
|
|
|
}
|
2019-12-01 04:56:40 +03:00
|
|
|
|
2020-09-16 23:40:43 +03:00
|
|
|
static inline int hugetlb_report_node_meminfo(char *buf, int len, int nid)
|
2019-12-01 04:56:40 +03:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-04-30 02:07:48 +04:00
|
|
|
static inline void hugetlb_show_meminfo(void)
|
|
|
|
{
|
|
|
|
}
|
2019-12-01 04:56:40 +03:00
|
|
|
|
|
|
|
static inline struct page *follow_huge_pd(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, hugepd_t hpd, int flags,
|
|
|
|
int pdshift)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pmd(struct mm_struct *mm,
|
|
|
|
unsigned long address, pmd_t *pmd, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pud(struct mm_struct *mm,
|
|
|
|
unsigned long address, pud_t *pud, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pgd(struct mm_struct *mm,
|
|
|
|
unsigned long address, pgd_t *pgd, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int prepare_hugepage_range(struct file *file,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_huge(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pud_huge(pud_t pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int is_hugepage_only_range(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
|
|
|
|
unsigned long addr, unsigned long end,
|
|
|
|
unsigned long floor, unsigned long ceiling)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
2021-05-05 04:35:45 +03:00
|
|
|
#ifdef CONFIG_USERFAULTFD
|
2019-12-01 04:56:40 +03:00
|
|
|
static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
|
|
|
|
pte_t *dst_pte,
|
|
|
|
struct vm_area_struct *dst_vma,
|
|
|
|
unsigned long dst_addr,
|
|
|
|
unsigned long src_addr,
|
userfaultfd: add UFFDIO_CONTINUE ioctl
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It
might change the contents of the page using its second non-UFFD mapping,
or not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)
Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 04:35:49 +03:00
|
|
|
enum mcopy_atomic_mode mode,
|
2019-12-01 04:56:40 +03:00
|
|
|
struct page **pagep)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2021-05-05 04:35:45 +03:00
|
|
|
#endif /* CONFIG_USERFAULTFD */
|
2019-12-01 04:56:40 +03:00
|
|
|
|
|
|
|
static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
|
|
|
|
unsigned long sz)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-08-01 03:42:03 +04:00
|
|
|
|
2013-12-13 05:12:19 +04:00
|
|
|
static inline bool isolate_huge_page(struct page *page, struct list_head *list)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2021-06-16 04:23:13 +03:00
|
|
|
static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.
Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
The current code first does prechecks roughly and then reconfirms after
taking refcount, but it's found that it makes code overly complicated,
so move the prechecks in a single hugetlb_lock range.
A newly introduced function, try_memory_failure_hugetlb(), always takes
hugetlb_lock (even for non-hugetlb pages). That can be improved, but
memory_failure() is rare in principle, so should not be a big problem.
Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22 02:35:33 +03:00
|
|
|
static inline int get_huge_page_for_hwpoison(unsigned long pfn, int flags)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-12-01 04:56:40 +03:00
|
|
|
static inline void putback_active_hugepage(struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void move_hugetlb_state(struct page *oldpage,
|
|
|
|
struct page *newpage, int reason)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long hugetlb_change_protection(
|
|
|
|
struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned long end, pgprot_t newprot)
|
2012-11-19 06:14:23 +04:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2006-03-22 11:08:50 +03:00
|
|
|
|
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.
This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;
[ ..........] Lots of bad pmd messages followed by this
[ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[ 127.186778] ------------[ cut here ]------------
[ 127.186781] kernel BUG at mm/filemap.c:134!
[ 127.186782] invalid opcode: 0000 [#1] SMP
[ 127.186783] CPU 7
[ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[ 127.186801]
[ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
[ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[ 127.186821] Stack:
[ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[ 127.186827] Call Trace:
[ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
[ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
[ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
[ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
[ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
[ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
[ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
[ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
[ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186870] RSP <ffff8804144b5c08>
[ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.
$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done
On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted. The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON. shutdown will also hang and needs a
hard reset or a sysrq-b.
The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.
Unfortunately it appears to be also insufficient. Consider the following
situation
Process A Process B
--------- ---------
hugetlb_fault shmdt
LockWrite(mmap_sem)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_hugepage_range
Lock(i_mmap_mutex)
Lock(mm->page_table_lock)
huge_pmd_unshare/unmap tables <--- (1)
Unlock(mm->page_table_lock)
Unlock(i_mmap_mutex)
huge_pte_alloc ...
Lock(i_mmap_mutex) ...
vma_prio_walk, find svma, spte ...
Lock(mm->page_table_lock) ...
share spte ...
Unlock(mm->page_table_lock) ...
Unlock(i_mmap_mutex) ...
hugetlb_no_page <--- (2)
free_pgtables
unlink_file_vma
hugetlb_free_pgd_range
remove_vma_list
In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down. The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.
This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed. This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
ineligible for sharing and avoids the race. Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).
This should be treated as a -stable candidate if it is merged.
Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.
==== CUT HERE ====
static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;
unsigned int get_random(unsigned int max)
{
struct timeval tv;
gettimeofday(&tv, NULL);
srandom(tv.tv_usec);
return random() % max;
}
static void play(void *addr, size_t size)
{
unsigned char *start = addr,
*end = start + size,
*a;
start += get_random(size/2);
/* we could itterate on huge pages but let's give it more time. */
for (a = start; a < end; a += 4096)
*a = 0;
}
int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;
if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}
[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:46:20 +04:00
|
|
|
static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
|
|
|
|
struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, struct page *ref_page)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
2019-03-29 06:43:51 +03:00
|
|
|
static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
|
2019-12-01 04:56:40 +03:00
|
|
|
struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned int flags)
|
2019-03-29 06:43:51 +03:00
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2012-08-01 03:42:03 +04:00
|
|
|
|
2021-05-05 04:33:13 +03:00
|
|
|
static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif /* !CONFIG_HUGETLB_PAGE */
|
2014-11-05 19:27:40 +03:00
|
|
|
/*
|
|
|
|
* hugepages at page global directory. If arch support
|
|
|
|
* hugepages at pgd level, they need to define this.
|
|
|
|
*/
|
|
|
|
#ifndef pgd_huge
|
|
|
|
#define pgd_huge(x) 0
|
|
|
|
#endif
|
2017-03-09 17:24:07 +03:00
|
|
|
#ifndef p4d_huge
|
|
|
|
#define p4d_huge(x) 0
|
|
|
|
#endif
|
2014-11-05 19:27:40 +03:00
|
|
|
|
|
|
|
#ifndef pgd_write
|
|
|
|
static inline int pgd_write(pgd_t pgd)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-09-22 04:03:47 +04:00
|
|
|
#define HUGETLB_ANON_FILE "anon_hugepage"
|
|
|
|
|
2009-09-22 04:03:43 +04:00
|
|
|
enum {
|
|
|
|
/*
|
|
|
|
* The file will be used as an shm file so shmfs accounting rules
|
|
|
|
* apply
|
|
|
|
*/
|
|
|
|
HUGETLB_SHMFS_INODE = 1,
|
2009-09-22 04:03:47 +04:00
|
|
|
/*
|
|
|
|
* The file is being created on the internal vfs mount and shmfs
|
|
|
|
* accounting rules do not apply
|
|
|
|
*/
|
|
|
|
HUGETLB_ANONHUGE_INODE = 2,
|
2009-09-22 04:03:43 +04:00
|
|
|
};
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#ifdef CONFIG_HUGETLBFS
|
|
|
|
struct hugetlbfs_sb_info {
|
|
|
|
long max_inodes; /* inodes allowed */
|
|
|
|
long free_inodes; /* inodes free */
|
|
|
|
spinlock_t stat_lock;
|
2008-07-24 08:27:43 +04:00
|
|
|
struct hstate *hstate;
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:12 +04:00
|
|
|
struct hugepage_subpool *spool;
|
2017-07-05 18:24:18 +03:00
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
|
|
|
umode_t mode;
|
2005-04-17 02:20:36 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return sb->s_fs_info;
|
|
|
|
}
|
|
|
|
|
2018-02-01 03:19:22 +03:00
|
|
|
struct hugetlbfs_inode_info {
|
|
|
|
struct shared_policy policy;
|
|
|
|
struct inode vfs_inode;
|
2018-02-01 03:19:25 +03:00
|
|
|
unsigned int seals;
|
2018-02-01 03:19:22 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
|
|
|
|
{
|
|
|
|
return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
|
|
|
|
}
|
|
|
|
|
2006-03-28 13:56:42 +04:00
|
|
|
extern const struct file_operations hugetlbfs_file_operations;
|
2009-09-27 22:29:37 +04:00
|
|
|
extern const struct vm_operations_struct hugetlb_vm_ops;
|
2013-05-08 03:18:13 +04:00
|
|
|
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
|
2021-11-09 05:31:27 +03:00
|
|
|
int creat_flags, int page_size_log);
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2016-01-15 02:18:51 +03:00
|
|
|
static inline bool is_file_hugepages(struct file *file)
|
2005-04-17 02:20:36 +04:00
|
|
|
{
|
2007-03-02 02:46:08 +03:00
|
|
|
if (file->f_op == &hugetlbfs_file_operations)
|
2016-01-15 02:18:51 +03:00
|
|
|
return true;
|
2007-03-02 02:46:08 +03:00
|
|
|
|
2016-01-15 02:18:51 +03:00
|
|
|
return is_file_shm_hugepages(file);
|
2005-04-17 02:20:36 +04:00
|
|
|
}
|
|
|
|
|
2020-04-02 07:11:54 +03:00
|
|
|
static inline struct hstate *hstate_inode(struct inode *i)
|
|
|
|
{
|
|
|
|
return HUGETLBFS_SB(i->i_sb)->hstate;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
#else /* !CONFIG_HUGETLBFS */
|
|
|
|
|
2016-01-15 02:18:51 +03:00
|
|
|
#define is_file_hugepages(file) false
|
2012-03-22 03:34:14 +04:00
|
|
|
static inline struct file *
|
2013-05-08 03:18:13 +04:00
|
|
|
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
|
2021-11-09 05:31:27 +03:00
|
|
|
int creat_flags, int page_size_log)
|
2009-09-25 01:47:45 +04:00
|
|
|
{
|
|
|
|
return ERR_PTR(-ENOSYS);
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
|
2020-04-02 07:11:54 +03:00
|
|
|
static inline struct hstate *hstate_inode(struct inode *i)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif /* !CONFIG_HUGETLBFS */
|
|
|
|
|
2007-05-07 01:49:00 +04:00
|
|
|
#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
|
|
|
|
unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
|
|
|
|
unsigned long len, unsigned long pgoff,
|
|
|
|
unsigned long flags);
|
|
|
|
#endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
|
|
|
|
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
/*
|
|
|
|
* huegtlb page specific state flags. These flags are located in page.private
|
|
|
|
* of the hugetlb head page. Functions created via the below macros should be
|
|
|
|
* used to manipulate these flags.
|
|
|
|
*
|
|
|
|
* HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
|
|
|
|
* allocation time. Cleared when page is fully instantiated. Free
|
|
|
|
* routine checks flag to restore a reservation on error paths.
|
2021-02-24 23:09:08 +03:00
|
|
|
* Synchronization: Examined or modified by code that knows it has
|
|
|
|
* the only reference to page. i.e. After allocation but before use
|
|
|
|
* or when the page is being freed.
|
2021-02-24 23:08:56 +03:00
|
|
|
* HPG_migratable - Set after a newly allocated page is added to the page
|
|
|
|
* cache and/or page tables. Indicates the page is a candidate for
|
|
|
|
* migration.
|
2021-02-24 23:09:08 +03:00
|
|
|
* Synchronization: Initially set after new page allocation with no
|
|
|
|
* locking. When examined and modified during migration processing
|
|
|
|
* (isolate, migrate, putback) the hugetlb_lock is held.
|
2021-02-24 23:09:00 +03:00
|
|
|
* HPG_temporary - - Set on a page that is temporarily allocated from the buddy
|
|
|
|
* allocator. Typically used for migration target pages when no pages
|
|
|
|
* are available in the pool. The hugetlb free page path will
|
|
|
|
* immediately free pages with this flag set to the buddy allocator.
|
2021-02-24 23:09:08 +03:00
|
|
|
* Synchronization: Can be set after huge page allocation from buddy when
|
|
|
|
* code knows it has only reference. All other examinations and
|
|
|
|
* modifications require hugetlb_lock.
|
2021-02-24 23:09:04 +03:00
|
|
|
* HPG_freed - Set when page is on the free lists.
|
2021-02-24 23:09:08 +03:00
|
|
|
* Synchronization: hugetlb_lock held for examination and modification.
|
2021-07-01 04:47:21 +03:00
|
|
|
* HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
*/
|
|
|
|
enum hugetlb_page_flags {
|
|
|
|
HPG_restore_reserve = 0,
|
2021-02-24 23:08:56 +03:00
|
|
|
HPG_migratable,
|
2021-02-24 23:09:00 +03:00
|
|
|
HPG_temporary,
|
2021-02-24 23:09:04 +03:00
|
|
|
HPG_freed,
|
2021-07-01 04:47:21 +03:00
|
|
|
HPG_vmemmap_optimized,
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
__NR_HPAGEFLAGS,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Macros to create test, set and clear function definitions for
|
|
|
|
* hugetlb specific page flags.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
#define TESTHPAGEFLAG(uname, flname) \
|
|
|
|
static inline int HPage##uname(struct page *page) \
|
|
|
|
{ return test_bit(HPG_##flname, &(page->private)); }
|
|
|
|
|
|
|
|
#define SETHPAGEFLAG(uname, flname) \
|
|
|
|
static inline void SetHPage##uname(struct page *page) \
|
|
|
|
{ set_bit(HPG_##flname, &(page->private)); }
|
|
|
|
|
|
|
|
#define CLEARHPAGEFLAG(uname, flname) \
|
|
|
|
static inline void ClearHPage##uname(struct page *page) \
|
|
|
|
{ clear_bit(HPG_##flname, &(page->private)); }
|
|
|
|
#else
|
|
|
|
#define TESTHPAGEFLAG(uname, flname) \
|
|
|
|
static inline int HPage##uname(struct page *page) \
|
|
|
|
{ return 0; }
|
|
|
|
|
|
|
|
#define SETHPAGEFLAG(uname, flname) \
|
|
|
|
static inline void SetHPage##uname(struct page *page) \
|
|
|
|
{ }
|
|
|
|
|
|
|
|
#define CLEARHPAGEFLAG(uname, flname) \
|
|
|
|
static inline void ClearHPage##uname(struct page *page) \
|
|
|
|
{ }
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define HPAGEFLAG(uname, flname) \
|
|
|
|
TESTHPAGEFLAG(uname, flname) \
|
|
|
|
SETHPAGEFLAG(uname, flname) \
|
|
|
|
CLEARHPAGEFLAG(uname, flname) \
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create functions associated with hugetlb page flags
|
|
|
|
*/
|
|
|
|
HPAGEFLAG(RestoreReserve, restore_reserve)
|
2021-02-24 23:08:56 +03:00
|
|
|
HPAGEFLAG(Migratable, migratable)
|
2021-02-24 23:09:00 +03:00
|
|
|
HPAGEFLAG(Temporary, temporary)
|
2021-02-24 23:09:04 +03:00
|
|
|
HPAGEFLAG(Freed, freed)
|
2021-07-01 04:47:21 +03:00
|
|
|
HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
|
2008-07-24 08:27:41 +04:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
|
2008-07-24 08:27:44 +04:00
|
|
|
#define HSTATE_NAME_LEN 32
|
2008-07-24 08:27:41 +04:00
|
|
|
/* Defines one hugetlb page size */
|
|
|
|
struct hstate {
|
2021-05-05 04:34:52 +03:00
|
|
|
struct mutex resize_lock;
|
2009-09-22 04:01:22 +04:00
|
|
|
int next_nid_to_alloc;
|
|
|
|
int next_nid_to_free;
|
2008-07-24 08:27:41 +04:00
|
|
|
unsigned int order;
|
hugetlb: add demote hugetlb page sysfs interfaces
Patch series "hugetlb: add demote/split page functionality", v4.
The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common. One of the reasons is better TLB support for
gigantic page sizes on x86 hardware. In addition, hugetlb pages are
being used to back VMs in hosting environments.
When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools. This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup. In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.
In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes. Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain. In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.
Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages. For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
On an idle system this operation is fairly reliable and results are as
expected. The number of 2MB pages is increased as expected and the time
of the operation is a second or two.
However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
will match the size of the larger page which was freed. This is
because the area freed by the larger page could quickly be
fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:
Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real 0m42.092s
user 0m0.008s
sys 0m41.467s
To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size. This avoids freeing pages to buddy and then
trying to allocate from buddy.
Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size
Target page size for demotion, a smaller huge page size. File
can be written to chose a smaller huge page size if multiple are
available.
- demote
Writable number of hugetlb pages to be demoted
To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote
Only hugetlb pages which are free at the time of the request can be
demoted. Demotion does not add to the complexity of surplus pages and
honors reserved huge pages. Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted. It is possible fewer will actually be demoted. The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.
Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86. Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced. The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.
This patch (of 5):
Two new sysfs files are added to demote hugtlb pages. These files are
both per-hugetlb page size and per node. Files are:
demote_size - The size in Kb that pages are demoted to. (read-write)
demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size. Valid huge
page sizes less than huge page size may be written to this file. When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size. For example, on x86 1GB huge
pages will have demote interfaces. 2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality. It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 23:41:20 +03:00
|
|
|
unsigned int demote_order;
|
2008-07-24 08:27:41 +04:00
|
|
|
unsigned long mask;
|
|
|
|
unsigned long max_huge_pages;
|
|
|
|
unsigned long nr_huge_pages;
|
|
|
|
unsigned long free_huge_pages;
|
|
|
|
unsigned long resv_huge_pages;
|
|
|
|
unsigned long surplus_huge_pages;
|
|
|
|
unsigned long nr_overcommit_huge_pages;
|
2012-08-01 03:42:07 +04:00
|
|
|
struct list_head hugepage_activelist;
|
2008-07-24 08:27:41 +04:00
|
|
|
struct list_head hugepage_freelists[MAX_NUMNODES];
|
2021-11-05 23:43:28 +03:00
|
|
|
unsigned int max_huge_pages_node[MAX_NUMNODES];
|
2008-07-24 08:27:41 +04:00
|
|
|
unsigned int nr_huge_pages_node[MAX_NUMNODES];
|
|
|
|
unsigned int free_huge_pages_node[MAX_NUMNODES];
|
|
|
|
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
|
2021-07-01 04:47:33 +03:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
|
|
|
|
unsigned int nr_free_vmemmap_pages;
|
|
|
|
#endif
|
2012-08-01 03:42:24 +04:00
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/* cgroup control files */
|
2022-01-15 01:07:48 +03:00
|
|
|
struct cftype cgroup_files_dfl[8];
|
|
|
|
struct cftype cgroup_files_legacy[10];
|
2012-08-01 03:42:24 +04:00
|
|
|
#endif
|
2008-07-24 08:27:44 +04:00
|
|
|
char name[HSTATE_NAME_LEN];
|
2008-07-24 08:27:41 +04:00
|
|
|
};
|
|
|
|
|
2008-07-24 08:27:52 +04:00
|
|
|
struct huge_bootmem_page {
|
|
|
|
struct list_head list;
|
|
|
|
struct hstate *hstate;
|
|
|
|
};
|
|
|
|
|
2021-05-05 04:35:29 +03:00
|
|
|
int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
|
2015-09-09 01:01:54 +03:00
|
|
|
struct page *alloc_huge_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, int avoid_reserve);
|
2017-07-11 01:49:11 +03:00
|
|
|
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
|
2020-08-12 04:37:17 +03:00
|
|
|
nodemask_t *nmask, gfp_t gfp_mask);
|
2018-02-01 03:21:03 +03:00
|
|
|
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
2015-09-09 01:01:50 +03:00
|
|
|
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
|
|
|
|
pgoff_t idx);
|
2021-06-16 04:23:29 +03:00
|
|
|
void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
|
|
|
|
unsigned long address, struct page *page);
|
2010-09-08 05:19:33 +04:00
|
|
|
|
2008-07-24 08:27:52 +04:00
|
|
|
/* arch callback */
|
2021-11-05 23:43:28 +03:00
|
|
|
int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
|
|
|
|
int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
|
|
|
|
bool __init hugetlb_node_alloc_supported(void);
|
2008-07-24 08:27:52 +04:00
|
|
|
|
2008-07-24 08:27:42 +04:00
|
|
|
void __init hugetlb_add_hstate(unsigned order);
|
2020-06-04 02:00:34 +03:00
|
|
|
bool __init arch_hugetlb_valid_size(unsigned long size);
|
2008-07-24 08:27:42 +04:00
|
|
|
struct hstate *size_to_hstate(unsigned long size);
|
|
|
|
|
|
|
|
#ifndef HUGE_MAX_HSTATE
|
|
|
|
#define HUGE_MAX_HSTATE 1
|
|
|
|
#endif
|
|
|
|
|
|
|
|
extern struct hstate hstates[HUGE_MAX_HSTATE];
|
|
|
|
extern unsigned int default_hstate_idx;
|
|
|
|
|
|
|
|
#define default_hstate (hstates[default_hstate_idx])
|
2008-07-24 08:27:41 +04:00
|
|
|
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
/*
|
|
|
|
* hugetlb page subpool pointer located in hpage[1].private
|
|
|
|
*/
|
|
|
|
static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
|
|
|
|
{
|
2021-07-01 04:47:09 +03:00
|
|
|
return (void *)page_private(hpage + SUBPAGE_INDEX_SUBPOOL);
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_set_page_subpool(struct page *hpage,
|
|
|
|
struct hugepage_subpool *subpool)
|
|
|
|
{
|
2021-07-01 04:47:09 +03:00
|
|
|
set_page_private(hpage + SUBPAGE_INDEX_SUBPOOL, (unsigned long)subpool);
|
hugetlb: use page.private for hugetlb specific page flags
Patch series "create hugetlb flags to consolidate state", v3.
While discussing a series of hugetlb fixes in [1], it became evident that
the hugetlb specific page state information is stored in a somewhat
haphazard manner. Code dealing with state information would be easier to
read, understand and maintain if this information was stored in a
consistent manner.
This series uses page.private of the hugetlb head page for storing a set
of hugetlb specific page flags. Routines are priovided for test, set and
clear of the flags.
[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com
This patch (of 4):
As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages. Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code. Sometimes, the naming
is just confusing. For example: The PagePrivate flag indicates a huge
page reservation was consumed and needs to be restored if an error is
encountered and the page is freed before it is instantiated. The
page.private field contains the pointer to a subpool if the page is
associated with one.
In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags. These flags will have test, set and clear
functions similar to those used for 'normal' page flags. More
importantly, an enum of flag values will be created with names that
actually reflect their purpose.
In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate
Conversion of other state information will happen in subsequent patches.
Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 23:08:51 +03:00
|
|
|
}
|
|
|
|
|
2008-07-24 08:27:41 +04:00
|
|
|
static inline struct hstate *hstate_file(struct file *f)
|
|
|
|
{
|
2013-01-24 02:07:38 +04:00
|
|
|
return hstate_inode(file_inode(f));
|
2008-07-24 08:27:41 +04:00
|
|
|
}
|
|
|
|
|
2013-05-08 03:18:13 +04:00
|
|
|
static inline struct hstate *hstate_sizelog(int page_size_log)
|
|
|
|
{
|
|
|
|
if (!page_size_log)
|
|
|
|
return &default_hstate;
|
2014-12-11 02:44:13 +03:00
|
|
|
|
|
|
|
return size_to_hstate(1UL << page_size_log);
|
2013-05-08 03:18:13 +04:00
|
|
|
}
|
|
|
|
|
2008-07-24 08:27:43 +04:00
|
|
|
static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
|
2008-07-24 08:27:41 +04:00
|
|
|
{
|
2008-07-24 08:27:43 +04:00
|
|
|
return hstate_file(vma->vm_file);
|
2008-07-24 08:27:41 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long huge_page_size(struct hstate *h)
|
|
|
|
{
|
|
|
|
return (unsigned long)PAGE_SIZE << h->order;
|
|
|
|
}
|
|
|
|
|
2009-01-07 01:38:53 +03:00
|
|
|
extern unsigned long vma_kernel_pagesize(struct vm_area_struct *vma);
|
|
|
|
|
2009-01-07 01:38:54 +03:00
|
|
|
extern unsigned long vma_mmu_pagesize(struct vm_area_struct *vma);
|
|
|
|
|
2008-07-24 08:27:41 +04:00
|
|
|
static inline unsigned long huge_page_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->mask;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_order(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->order;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned huge_page_shift(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->order + PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2014-06-05 03:07:08 +04:00
|
|
|
static inline bool hstate_is_gigantic(struct hstate *h)
|
|
|
|
{
|
|
|
|
return huge_page_order(h) >= MAX_ORDER;
|
|
|
|
}
|
|
|
|
|
2008-07-24 08:27:41 +04:00
|
|
|
static inline unsigned int pages_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 1 << h->order;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blocks_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return huge_page_size(h) / 512;
|
|
|
|
}
|
|
|
|
|
|
|
|
#include <asm/hugetlb.h>
|
|
|
|
|
2020-06-04 02:01:01 +03:00
|
|
|
#ifndef is_hugepage_only_range
|
|
|
|
static inline int is_hugepage_only_range(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#define is_hugepage_only_range is_hugepage_only_range
|
|
|
|
#endif
|
|
|
|
|
2020-06-04 02:01:05 +03:00
|
|
|
#ifndef arch_clear_hugepage_flags
|
|
|
|
static inline void arch_clear_hugepage_flags(struct page *page) { }
|
|
|
|
#define arch_clear_hugepage_flags arch_clear_hugepage_flags
|
|
|
|
#endif
|
|
|
|
|
2012-04-01 22:01:34 +04:00
|
|
|
#ifndef arch_make_huge_pte
|
mm/hugetlb: change parameters of arch_make_huge_pte()
Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
This series implements huge VMAP and VMALLOC on powerpc 8xx.
Powerpc 8xx has 4 page sizes:
- 4k
- 16k
- 512k
- 8M
At the time being, vmalloc and vmap only support huge pages which are
leaf at PMD level.
Here the PMD level is 4M, it doesn't correspond to any supported
page size.
For now, implement use of 16k and 512k pages which is done
at PTE level.
Support of 8M pages will be implemented later, it requires use of
hugepd tables.
To allow this, the architecture provides two functions:
- arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
page size to use. A stub returning PAGE_SIZE is provided when the
architecture doesn't provide this function.
- arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
what page shift to use for a given area size. A stub returning
PAGE_SHIFT is provided when the architecture doesn't provide this
function.
This patch (of 5):
At the time being, arch_make_huge_pte() has the following prototype:
pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
struct page *page, int writable);
vma is used to get the pages shift or size.
vma is also used on Sparc to get vm_flags.
page is not used.
writable is not used.
In order to use this function without a vma, replace vma by shift and
flags. Also remove the used parameters.
Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 04:48:00 +03:00
|
|
|
static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
|
|
|
|
vm_flags_t flags)
|
2012-04-01 22:01:34 +04:00
|
|
|
{
|
2022-03-23 00:41:47 +03:00
|
|
|
return pte_mkhuge(entry);
|
2012-04-01 22:01:34 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-24 08:27:42 +04:00
|
|
|
static inline struct hstate *page_hstate(struct page *page)
|
|
|
|
{
|
2014-01-24 03:52:54 +04:00
|
|
|
VM_BUG_ON_PAGE(!PageHuge(page), page);
|
2019-09-24 01:34:25 +03:00
|
|
|
return size_to_hstate(page_size(page));
|
2008-07-24 08:27:42 +04:00
|
|
|
}
|
|
|
|
|
2010-10-06 23:45:00 +04:00
|
|
|
static inline unsigned hstate_index_to_shift(unsigned index)
|
|
|
|
{
|
|
|
|
return hstates[index].order + PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2012-08-01 03:42:00 +04:00
|
|
|
static inline int hstate_index(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h - hstates;
|
|
|
|
}
|
|
|
|
|
2017-07-11 01:47:41 +03:00
|
|
|
extern int dissolve_free_huge_page(struct page *page);
|
2016-10-08 03:01:10 +03:00
|
|
|
extern int dissolve_free_huge_pages(unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn);
|
2019-03-06 02:43:51 +03:00
|
|
|
|
2014-06-05 03:05:35 +04:00
|
|
|
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
|
2019-03-06 02:43:51 +03:00
|
|
|
#ifndef arch_hugetlb_migration_supported
|
|
|
|
static inline bool arch_hugetlb_migration_supported(struct hstate *h)
|
|
|
|
{
|
2017-07-07 01:38:38 +03:00
|
|
|
if ((huge_page_shift(h) == PMD_SHIFT) ||
|
2019-03-06 02:43:48 +03:00
|
|
|
(huge_page_shift(h) == PUD_SHIFT) ||
|
|
|
|
(huge_page_shift(h) == PGDIR_SHIFT))
|
2017-07-07 01:38:38 +03:00
|
|
|
return true;
|
|
|
|
else
|
|
|
|
return false;
|
2019-03-06 02:43:51 +03:00
|
|
|
}
|
|
|
|
#endif
|
2014-06-05 03:05:35 +04:00
|
|
|
#else
|
2019-03-06 02:43:51 +03:00
|
|
|
static inline bool arch_hugetlb_migration_supported(struct hstate *h)
|
|
|
|
{
|
2016-05-21 02:58:01 +03:00
|
|
|
return false;
|
2019-03-06 02:43:51 +03:00
|
|
|
}
|
2014-06-05 03:05:35 +04:00
|
|
|
#endif
|
2019-03-06 02:43:51 +03:00
|
|
|
|
|
|
|
static inline bool hugepage_migration_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return arch_hugetlb_migration_supported(h);
|
2013-09-12 01:22:11 +04:00
|
|
|
}
|
2013-09-12 01:22:09 +04:00
|
|
|
|
2019-03-06 02:43:44 +03:00
|
|
|
/*
|
|
|
|
* Movability check is different as compared to migration check.
|
|
|
|
* It determines whether or not a huge page should be placed on
|
|
|
|
* movable zone or not. Movability of any huge page should be
|
|
|
|
* required only if huge page size is supported for migration.
|
2021-07-08 04:08:19 +03:00
|
|
|
* There won't be any reason for the huge page to be movable if
|
2019-03-06 02:43:44 +03:00
|
|
|
* it is not migratable to start with. Also the size of the huge
|
|
|
|
* page should be large enough to be placed under a movable zone
|
|
|
|
* and still feasible enough to be migratable. Just the presence
|
|
|
|
* in movable zone does not make the migration feasible.
|
|
|
|
*
|
|
|
|
* So even though large huge page sizes like the gigantic ones
|
|
|
|
* are migratable they should not be movable because its not
|
|
|
|
* feasible to migrate them from movable zone.
|
|
|
|
*/
|
|
|
|
static inline bool hugepage_movable_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
if (!hugepage_migration_supported(h))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (hstate_is_gigantic(h))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2020-08-12 04:37:17 +03:00
|
|
|
/* Movability of hugepages depends on migration support. */
|
|
|
|
static inline gfp_t htlb_alloc_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
if (hugepage_movable_supported(h))
|
|
|
|
return GFP_HIGHUSER_MOVABLE;
|
|
|
|
else
|
|
|
|
return GFP_HIGHUSER;
|
|
|
|
}
|
|
|
|
|
mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.
Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.
Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.
[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:37:25 +03:00
|
|
|
static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
gfp_t modified_mask = htlb_alloc_mask(h);
|
|
|
|
|
|
|
|
/* Some callers might want to enforce node */
|
|
|
|
modified_mask |= (gfp_mask & __GFP_THISNODE);
|
|
|
|
|
2020-08-12 04:37:34 +03:00
|
|
|
modified_mask |= (gfp_mask & __GFP_NOWARN);
|
|
|
|
|
mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.
Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.
Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.
[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:37:25 +03:00
|
|
|
return modified_mask;
|
|
|
|
}
|
|
|
|
|
2013-11-15 02:31:02 +04:00
|
|
|
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
if (huge_page_size(h) == PMD_SIZE)
|
|
|
|
return pmd_lockptr(mm, (pmd_t *) pte);
|
|
|
|
VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
|
|
|
|
2015-07-18 02:23:37 +03:00
|
|
|
#ifndef hugepages_supported
|
|
|
|
/*
|
|
|
|
* Some platform decide whether they support huge pages at boot
|
|
|
|
* time. Some of them, such as powerpc, set HPAGE_SHIFT to 0
|
|
|
|
* when there is no such support
|
|
|
|
*/
|
|
|
|
#define hugepages_supported() (HPAGE_SHIFT != 0)
|
|
|
|
#endif
|
hugetlb: ensure hugepage access is denied if hugepages are not supported
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:
Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....
In KVM guests on Power, in a guest not backed by hugepages, we see the
following:
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kB
HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.
This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.
[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06 23:50:00 +04:00
|
|
|
|
2015-11-06 05:47:14 +03:00
|
|
|
void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
|
|
|
|
|
2021-09-09 04:10:05 +03:00
|
|
|
static inline void hugetlb_count_init(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
atomic_long_set(&mm->hugetlb_usage, 0);
|
|
|
|
}
|
|
|
|
|
2015-11-06 05:47:14 +03:00
|
|
|
static inline void hugetlb_count_add(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
atomic_long_add(l, &mm->hugetlb_usage);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
atomic_long_sub(l, &mm->hugetlb_usage);
|
|
|
|
}
|
2017-07-07 01:39:50 +03:00
|
|
|
|
|
|
|
#ifndef set_huge_swap_pte_at
|
|
|
|
static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, unsigned long sz)
|
|
|
|
{
|
|
|
|
set_huge_pte_at(mm, addr, ptep, pte);
|
|
|
|
}
|
|
|
|
#endif
|
2019-03-06 02:46:37 +03:00
|
|
|
|
|
|
|
#ifndef huge_ptep_modify_prot_start
|
|
|
|
#define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
|
|
|
|
static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t *ptep)
|
|
|
|
{
|
|
|
|
return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef huge_ptep_modify_prot_commit
|
|
|
|
#define huge_ptep_modify_prot_commit huge_ptep_modify_prot_commit
|
|
|
|
static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t *ptep,
|
|
|
|
pte_t old_pte, pte_t pte)
|
|
|
|
{
|
|
|
|
set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-05-08 03:18:13 +04:00
|
|
|
#else /* CONFIG_HUGETLB_PAGE */
|
2008-07-24 08:27:41 +04:00
|
|
|
struct hstate {};
|
2019-07-12 06:54:40 +03:00
|
|
|
|
2021-07-01 04:51:29 +03:00
|
|
|
static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-05-05 04:35:29 +03:00
|
|
|
static inline int isolate_or_dissolve_huge_page(struct page *page,
|
|
|
|
struct list_head *list)
|
mm: make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.
We can do better by trying to replace such page.
Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with a
new one.
In order to do that, a new function called alloc_and_dissolve_huge_page is
introduced. This function will first try to get a new fresh hugepage, and
if it succeeds, it will replace the old one in the free hugepage pool.
The free page replacement is done under hugetlb_lock, so no external users
of hugetlb will notice the change. To allocate the new huge page, we use
alloc_buddy_huge_page(), so we do not have to deal with any counters, and
prep_new_huge_page() is not called. This is valulable because in case we
need to free the new page, we only need to call __free_pages().
Once we know that the page to be replaced is a genuine 0-refcounted huge
page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and
__prep_account_new_huge_page() for the new huge page to properly
initialize it and increment the hstate->nr_huge_pages counter (previously
decremented by remove_hugetlb_page()). Once done, the page is enqueued by
enqueue_huge_page() and it is ready to be used.
There is one tricky case when page's refcount is 0 because it is in the
process of being released. A missing PageHugeFreed bit will tell us that
freeing is in flight so we retry after dropping the hugetlb_lock. The
race window should be small and the next retry should make a forward
progress.
E.g:
CPU0 CPU1
free_huge_page() isolate_or_dissolve_huge_page
PageHuge() == T
alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) update_and_free_page
PageHuge() == F
__free_pages()
2) enqueue_huge_page
SetPageHugeFreed()
spin_unlock_irq(&hugetlb_lock)
spin_lock_irq(hugetlb_lock)
1) PageHuge() == F (freed by case#1 from CPU0)
2) PageHuge() == T
PageHugeFreed() == T
- proceed with replacing the page
In the case above we retry as the window race is quite small and we have
high chances to succeed next time.
With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.
Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 04:35:26 +03:00
|
|
|
{
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2019-07-12 06:54:40 +03:00
|
|
|
static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr,
|
|
|
|
int avoid_reserve)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *
|
2020-08-12 04:37:17 +03:00
|
|
|
alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
|
|
|
|
nodemask_t *nmask, gfp_t gfp_mask)
|
2019-07-12 06:54:40 +03:00
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *alloc_huge_page_vma(struct hstate *h,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int __alloc_bootmem_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_file(struct file *f)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_sizelog(int page_size_log)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *page_hstate(struct page *page)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2022-02-03 19:40:17 +03:00
|
|
|
static inline struct hstate *size_to_hstate(unsigned long size)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-07-12 06:54:40 +03:00
|
|
|
static inline unsigned long huge_page_size(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long huge_page_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_order(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_shift(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2017-07-07 01:38:38 +03:00
|
|
|
static inline bool hstate_is_gigantic(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2008-07-27 02:22:27 +04:00
|
|
|
static inline unsigned int pages_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
2017-07-11 01:47:41 +03:00
|
|
|
|
|
|
|
static inline unsigned hstate_index_to_shift(unsigned index)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int hstate_index(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2013-06-25 17:19:31 +04:00
|
|
|
|
2017-07-11 01:47:41 +03:00
|
|
|
static inline int dissolve_free_huge_page(struct page *page)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int dissolve_free_huge_pages(unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool hugepage_migration_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2013-11-15 02:31:02 +04:00
|
|
|
|
2019-03-06 02:43:44 +03:00
|
|
|
static inline bool hugepage_movable_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2020-08-12 04:37:17 +03:00
|
|
|
static inline gfp_t htlb_alloc_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
mm/migrate: introduce a standard migration target allocation function
There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.
Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.
Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.
[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 04:37:25 +03:00
|
|
|
static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-11-15 02:31:02 +04:00
|
|
|
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
2015-11-06 05:47:14 +03:00
|
|
|
|
2021-09-09 04:10:05 +03:00
|
|
|
static inline void hugetlb_count_init(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2015-11-06 05:47:14 +03:00
|
|
|
static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
2017-07-07 01:39:50 +03:00
|
|
|
|
|
|
|
static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, unsigned long sz)
|
|
|
|
{
|
|
|
|
}
|
2013-05-08 03:18:13 +04:00
|
|
|
#endif /* CONFIG_HUGETLB_PAGE */
|
2008-07-24 08:27:41 +04:00
|
|
|
|
2013-11-15 02:31:02 +04:00
|
|
|
static inline spinlock_t *huge_pte_lock(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl;
|
|
|
|
|
|
|
|
ptl = huge_pte_lockptr(h, mm, pte);
|
|
|
|
spin_lock(ptl);
|
|
|
|
return ptl;
|
|
|
|
}
|
|
|
|
|
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-11 00:32:45 +03:00
|
|
|
#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
|
|
|
|
extern void __init hugetlb_cma_reserve(int order);
|
|
|
|
extern void __init hugetlb_cma_check(void);
|
|
|
|
#else
|
|
|
|
static inline __init void hugetlb_cma_reserve(int order)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline __init void hugetlb_cma_check(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-05-05 04:33:04 +03:00
|
|
|
bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
|
|
|
|
|
2021-05-05 04:33:08 +03:00
|
|
|
#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
|
|
|
|
/*
|
|
|
|
* ARCHes with special requirements for evicting HUGETLB backing TLB entries can
|
|
|
|
* implement this.
|
|
|
|
*/
|
|
|
|
#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 02:20:36 +04:00
|
|
|
#endif /* _LINUX_HUGETLB_H */
|