wakeup_flusher_threads() was added under the assumption that if a system
runs out of clean cold pages, it might want to write back dirty pages more
aggressively so that they can become clean and be dropped.
However, doing so can breach the rate limit a system wants to impose on
writeback, resulting in early SSD wearout.
Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
Fixes: bd74fdaea1 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
/a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
vQNj/iOBQA==
=R05e
-----END PGP SIGNATURE-----
Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- handle number of queue changes in the TCP and RDMA drivers
(Daniel Wagner)
- allow changing the number of queues in nvmet (Daniel Wagner)
- also consider host_iface when checking ip options (Daniel
Wagner)
- don't map pages which can't come from HIGHMEM (Fabio M. De
Francesco)
- avoid unnecessary flush bios in nvmet (Guixin Liu)
- shrink and better pack the nvme_iod structure (Keith Busch)
- add comment for unaligned "fake" nqn (Linjun Bao)
- print actual source IP address through sysfs "address" attr
(Martin Belanger)
- various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
- handle effects after freeing the request (Keith Busch)
- copy firmware_rev on each init (Keith Busch)
- restrict management ioctls to admin (Keith Busch)
- ensure subsystem reset is single threaded (Keith Busch)
- report the actual number of tagset maps in nvme-pci (Keith
Busch)
- small fabrics authentication fixups (Christoph Hellwig)
- add common code for tagset allocation and freeing (Christoph
Hellwig)
- stop using the request_queue in nvmet (Christoph Hellwig)
- set min_align_mask before calculating max_hw_sectors (Rishabh
Bhatnagar)
- send a rediscover uevent when a persistent discovery controller
reconnects (Sagi Grimberg)
- misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)
- MD pull request via Song:
- Various raid5 fix and clean up, by Logan Gunthorpe and David
Sloan.
- Raid10 performance optimization, by Yu Kuai.
- sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)
- IO scheduler switching quisce fix (Keith)
- s390/dasd block driver updates (Stefan)
- support for recovery for the ublk driver (ZiyangZhang)
- rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)
- blk-mq and null_blk map fixes (Bart)
- various bcache fixes (Coly, Jilin, Jules)
- nbd signal hang fix (Shigeru)
- block writeback throttling fix (Yu)
- optimize the passthrough mapping handling (me)
- prepare block cgroups to being gendisk based (Christoph)
- get rid of an old PSI hack in the block layer, moving it to the
callers instead where it belongs (Christoph)
- blk-throttle fixes and cleanups (Yu)
- misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
Miaohe, Bart, Coly, Gaosheng
* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
sbitmap: fix lockup while swapping
block: add rationale for not using blk_mq_plug() when applicable
block: adapt blk_mq_plug() to not plug for writes that require a zone lock
s390/dasd: use blk_mq_alloc_disk
blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
blk-mq: use quiesced elevator switch when reinitializing queues
block: replace blk_queue_nowait with bdev_nowait
nvme: remove nvme_ctrl_init_connect_q
nvme-loop: use the tagset alloc/free helpers
nvme-loop: store the generic nvme_ctrl in set->driver_data
nvme-loop: initialize sqsize later
nvme-fc: use the tagset alloc/free helpers
nvme-fc: store the generic nvme_ctrl in set->driver_data
nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
nvme-rdma: use the tagset alloc/free helpers
nvme-rdma: store the generic nvme_ctrl in set->driver_data
nvme-tcp: use the tagset alloc/free helpers
nvme-tcp: store the generic nvme_ctrl in set->driver_data
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmM6zNkACgkQxWXV+ddt
WDsNMg/+LTuwf6Js+mAl1AgtSpLOl2gLfNBJAUXhzwPbc3nF9bwONE/EUYEXTo5h
kTf1cQRj0NCIZ7iHDwXuWNm77diNl+SChEDIoc7k0d6P7Qmmn2AWbTLM4dleyg5S
6jxPpOMbegycQfL9tSJNaiT9zlZxj9Z+0yPibR99otrgtuv6zuvRxcdh34rEFIyf
xoabO3/18lAKHzYzAZxNXMpbUSBmqLPVoZEOcfBAXvcuIJkzKRP6Y9gwlYs+kn+D
J8BPa3LoSNxXrpCvWzlu7vO3gwNp7H7pQQqZKjjEcOZ+dj2UYQeTyJvl1vdzaNyk
EoFYlkaKkYi7RaonuHjNaTeD/igJf8Eo6DTiXzACECssbKutlvNG4HXuFApsWy7M
T7KZ5jTAQ98ZMYjgZ27UbEpFZd8lYHzV952Njjo9zbRVbqwaPEZTTdkjpz+3X6t4
Z0A951ixOYKiOVdu3Uj1fHaBv0n/p0wrXIGt3ZIdjufM9TctV3oJwOZOiM2H0ccb
XJVwsQG92+ja9XLZrw8H62PCKBYo3LL52r9b9NVodY9aTsQWTfiV5OP84RRlncCp
hzPkHmO1YIyVcLoijagiO7cW21pQbKfqsRX/P1F7DXyjosHppmDS7IHDWA7Adf3W
QA6eBnoWqVwBh7P+IyxJuRG0CrnxkPZeAZIhohDwk5Mt4NGATkA=
=NlUz
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There's a bunch of performance improvements, most notably the FIEMAP
speedup, the new block group tree to speed up mount on large
filesystems, more io_uring integration, some sysfs exports and the
usual fixes and core updates.
Summary:
Performance:
- outstanding FIEMAP speed improvement
- algorithmic change how extents are enumerated leads to orders of
magnitude speed boost (uncached and cached)
- extent sharing check speedup (2.2x uncached, 3x cached)
- add more cancellation points, allowing to interrupt seeking in
files with large number of extents
- more efficient hole and data seeking (4x uncached, 1.3x cached)
- sample results:
256M, 32K extents: 4s -> 29ms (~150x)
512M, 64K extents: 30s -> 59ms (~550x)
1G, 128K extents: 225s -> 120ms (~1800x)
- improved inode logging, especially for directories (on dbench
workload throughput +25%, max latency -21%)
- improved buffered IO, remove redundant extent state tracking,
lowering memory consumption and avoiding rb tree traversal
- add sysfs tunable to let qgroup temporarily skip exact accounting
when deleting snapshot, leading to a speedup but requiring a rescan
after that, will be used by snapper
- support io_uring and buffered writes, until now it was just for
direct IO, with the no-wait semantics implemented in the buffered
write path it now works and leads to speed improvement in IOPS
(2x), throughput (2.2x), latency (depends, 2x to 150x)
- small performance improvements when dropping and searching for
extent maps as well as when flushing delalloc in COW mode
(throughput +5MB/s)
User visible changes:
- new incompatible feature block-group-tree adding a dedicated tree
for tracking block groups, this allows a much faster load during
mount and avoids seeking unlike when it's scattered in the extent
tree items
- this reduces mount time for many-terabyte sized filesystems
- conversion tool will be provided so existing filesystem can also
be updated in place
- to reduce test matrix and feature combinations requires no-holes
and free-space-tree (mkfs defaults since 5.15)
- improved reporting of super block corruption detected by scrub
- scrub also tries to repair super block and does not wait until next
commit
- discard stats and tunables are exported in sysfs
(/sys/fs/btrfs/FSID/discard)
- qgroup status is exported in sysfs
(/sys/sys/fs/btrfs/FSID/qgroups/)
- verify that super block was not modified when thawing filesystem
Fixes:
- FIEMAP fixes
- fix extent sharing status, does not depend on the cached status
where merged
- flush delalloc so compressed extents are reported correctly
- fix alignment of VMA for memory mapped files on THP
- send: fix failures when processing inodes with no links (orphan
files and directories)
- fix race between quota enable and quota rescan ioctl
- handle more corner cases for read-only compat feature verification
- fix missed extent on fsync after dropping extent maps
Core:
- lockdep annotations to validate various transactions states and
state transitions
- preliminary support for fs-verity in send
- more effective memory use in scrub for subpage where sector is
smaller than page
- block group caching progress logic has been removed, load is now
synchronous
- simplify end IO callbacks and bio handling, use chained bios
instead of own tracking
- add no-wait semantics to several functions (tree search, nocow,
flushing, buffered write
- cleanups and refactoring
MM changes:
- export balance_dirty_pages_ratelimited_flags"
* tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (177 commits)
btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer
btrfs: drop extent map range more efficiently
btrfs: avoid pointless extent map tree search when flushing delalloc
btrfs: remove unnecessary next extent map search
btrfs: remove unnecessary NULL pointer checks when searching extent maps
btrfs: assert tree is locked when clearing extent map from logging
btrfs: remove unnecessary extent map initializations
btrfs: remove the refcount warning/check at free_extent_map()
btrfs: add helper to replace extent map range with a new extent map
btrfs: move open coded extent map tree deletion out of inode eviction
btrfs: use cond_resched_rwlock_write() during inode eviction
btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
btrfs: move btrfs_drop_extent_cache() to extent_map.c
btrfs: fix missed extent on fsync after dropping extent maps
btrfs: remove stale prototype of btrfs_write_inode
btrfs: enable nowait async buffered writes
btrfs: assert nowait mode is not used for some btree search functions
btrfs: make btrfs_buffered_write nowait compatible
btrfs: plumb NOWAIT through the write path
btrfs: make lock_and_cleanup_extent_if_need nowait compatible
...
Since 2d1c498072 ("mm: memcontrol: make swap tracking an integral part
of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config
option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP.
Update the sites accordingly and drop the symbol.
[ While touching the docs, remove two references to CONFIG_MEMCG_KMEM,
which hasn't been a user-visible symbol for over half a decade. ]
Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's slightly more descriptive and consistent with other places that
distinguish cgroup1's combined memory+swap accounting scheme from
cgroup2's dedicated swap accounting.
Link: https://lkml.kernel.org/r/20220926135704.400818-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The swapaccounting= commandline option already does very little today. To
close a trivial containment failure case, the swap ownership tracking part
of the swap controller has recently become mandatory (see commit
2d1c498072 ("mm: memcontrol: make swap tracking an integral part of
memory control") for details), which makes up the majority of the work
during swapout, swapin, and the swap slot map.
The only thing left under this flag is the page_counter operations and the
visibility of the swap control files in the first place, which are rather
meager savings. There also aren't many scenarios, if any, where
controlling the memory of a cgroup while allowing it unlimited access to a
global swap space is a workable resource isolation strategy.
On the other hand, there have been several bugs and confusion around the
many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
accounting without swap accounting, memcg runtime disabled).
This puts the maintenance overhead of retaining the toggle above its
practical benefits. Deprecate it.
Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg swap fix & cleanups".
This patch (of 4):
Since commit 2d1c498072 ("mm: memcontrol: make swap tracking an integral
part of memory control"), the cgroup swap arrays are used to track memory
ownership at the time of swap readahead and swapoff, even if swap space
*accounting* has been turned off by the user via swapaccount=0 (which sets
cgroup_memory_noswap).
However, the patch was overzealous: by simply dropping the
cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
path, it caused the cgroup arrays being allocated even when the memory
controller as a whole is disabled. This is a waste of that memory.
Restore mem_cgroup_disabled() checks, implied previously by
cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
callbacks.
Link: https://lkml.kernel.org/r/20220926135704.400818-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20220926135704.400818-2-hannes@cmpxchg.org
Fixes: 2d1c498072 ("mm: memcontrol: make swap tracking an integral part of memory control")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In hugetlb.c there are several places which compare the values of
'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
add a new available_huge_pages() function to do these.
Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
While this change is targeted at debugging MADV_COLLAPSE pathway, the
"mm_khugepaged" prefix is retained for symmetry with
huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
to prevent changing kernel ABI as much as possible.
Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
On success, the backing memory will be a hugepage. For the memory range
and process provided, the page tables will synchronously have a huge pmd
installed, mapping the THP. Other mappings of the file extent mapped by
the memory range may be added to a set of entries that khugepaged will
later process and attempt update their page tables to map the THP by a
pmd.
This functionality unlocks two important uses:
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. Now, we can have the best of both worlds: Peak
upfront performance and lower RAM footprints.
(2) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage). However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance.
Since khugepaged is single threaded, this change now introduces
possibility of collapse contexts racing in file collapse path. There a
important few places to consider:
(1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
We could have the memory collapsed out from under us, but
the next xas_for_each() iteration will correctly pick up the
hugepage. The hugepage might not be up to date (insofar as
copying of small page contents might not have completed - the
page still may be locked), but regardless what small page index
we were iterating over, we'll find the hugepage and identify it
as a suitably aligned compound page of order HPAGE_PMD_ORDER.
In khugepaged path, we locklessly check the value of the pmd,
and only add it to deferred collapse array if we find pmd
mapping pte table. This is fine, since other values that could
have raced in right afterwards denote failure, or that the
memory was successfully collapsed, so we don't need further
processing.
In madvise path, we'll take mmap_lock() in write to serialize
against page table updates and will know what to do based on the
true value of the pmd: recheck all ptes if we point to a pte table,
directly install the pmd, if the pmd has been cleared, but
memory not yet faulted, or nothing at all if we find a huge pmd.
It's worth putting emphasis here on how we treat the none pmd
here. If khugepaged has processed this mm's page tables
already, it will have left the pmd cleared (ready for refault by
the process). Depending on the VMA flags and sysfs settings,
amount of RAM on the machine, and the current load, could be a
relatively common occurrence - and as such is one we'd like to
handle successfully in MADV_COLLAPSE. When we see the none pmd
in collapse_pte_mapped_thp(), we've locked mmap_lock in write
and checked (a) huepaged_vma_check() to see if the backing
memory is appropriate still, along with VMA sizing and
appropriate hugepage alignment within the file, and (b) we've
found a hugepage head of order HPAGE_PMD_ORDER at the offset
in the file mapped by our hugepage-aligned virtual address.
Even though the common-case is likely race with khugepaged,
given these checks (regardless how we got here - we could be
operating on a completely different file than originally checked
in hpage_collapse_scan_file() for all we know) it should be safe
to directly make the pmd a huge pmd pointing to this hugepage.
(2) collapse_file() is mostly serialized on the same file extent by
lock sequence:
| lock hupepage
| lock mapping->i_pages
| lock 1st page
| unlock mapping->i_pages
| <page checks>
| lock mapping->i_pages
| page_ref_freeze(3)
| xas_store(hugepage)
| unlock mapping->i_pages
| page_ref_unfreeze(1)
| unlock 1st page
V unlock hugepage
Once a context (who already has their fresh hugepage locked)
locks mapping->i_pages exclusively, it will hold said lock
until it locks the first page, and it will hold that lock until
the after the hugepage has been added to the page cache (and
will unlock the hugepage after page table update, though that
isn't important here).
A racing context that loses the race for mapping->i_pages will
then lose the race to locking the first page. Here - depending
on how far the other racing context has gotten - we might find
the new hugepage (in which case we'll exit cleanly when we
check PageTransCompound()), or we'll find the "old" 1st small
page (in which we'll exit cleanly when we discover unexpected
refcount of 2 after isolate_lru_page()). This is assuming we
are able to successfully lock the page we find - in shmem path,
we could just fail the trylock and exit cleanly anyways.
Failure path in collapse_file() is similar: once we hold lock
on 1st small page, we are serialized against other collapse
contexts. Before the 1st small page is unlocked, we add it
back to the pagecache and unfreeze the refcount appropriately.
Contexts who lost the race to the 1st small page will then find
the same 1st small page with the correct refcount and will be
able to proceed.
[zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
[shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
check for multi-add in khugepaged_add_pte_mapped_thp()]
Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks. pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
in physical memory, don't have this advantage. In fact, one could argue
they are detrimental to system performance overall since they occupy a
precious hugepage-aligned/sized region of physical memory that could
otherwise be used more effectively. Additionally, pte-mapped hugepages
can be the cheapest memory to collapse for khugepaged since no new
hugepage allocation or copying of memory contents is necessary - we only
need to update the mapping page tables.
In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.
Identify pte-mapped hugepages in the file/shmem collapse path. The
final step of which makes a racy check of the value of the pmd to
ensure it maps a pte table. This should be fine, since races that
result in false-positive (i.e. attempt collapse even though we
shouldn't) will fail later in collapse_pte_mapped_thp() once we
actually lock mmap_lock and reinspect the pmd value. Races that result
in false-negatives (i.e. where we decide to not attempt collapse, but
should have) shouldn't be an issue, since in the worst case, we do
nothing - which is what we've done up to this point. We make a similar
check in retract_page_tables(). If we do think we've found a
pte-mapped hugepgae in khugepaged context, attempt to update page
tables mapping this hugepage.
Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
and if the pte-mapped hugepage was also mapped into multiple process'
address spaces, could be incremented for each page table update. Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either. This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed). Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.
Also note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.
[shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: add file/shmem support to MADV_COLLAPSE", v4.
This series builds on top of the previous "mm: userspace hugepage
collapse" series which introduced the MADV_COLLAPSE madvise mode and added
support for private, anonymous mappings[2], by adding support for file and
shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
File and shmem support have been added with effort to align with existing
MADV_COLLAPSE semantics and policy decisions[3]. Collapse of shmem-backed
memory ignores kernel-guiding directives and heuristics including all
sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
options (shmem always supports large folios). Like anonymous mappings, on
successful return of MADV_COLLAPSE on file/shmem memory, the contents of
memory mapped by the addresses provided will be synchronously pmd-mapped
THPs.
This functionality unlocks two important uses:
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. Now, we can have the best of both worlds: Peak
upfront performance and lower RAM footprints.
(2) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage). However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance.
khugepaged has received a small improvement by association and can now
detect and collapse pte-mapped THPs. However, there is still work to be
done along the file collapse path. Compound pages of arbitrary order
still needs to be supported and THP collapse needs to be converted to
using folios in general. Eventually, we'd like to move away from the
read-only and executable-mapped constraints currently imposed on eligible
files and support any inode claiming huge folio support. That said, I
think the series as-is covers enough to claim that MADV_COLLAPSE supports
file/shmem memory.
Patches 1-3 Implement the guts of the series.
Patch 4 Is a tracepoint for debugging.
Patches 5-9 Refactor existing khugepaged selftests to work with new
memory types + new collapse tests.
Patch 10 Adds a userfaultfd selftest mode to mimic a functional test
of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
(v4 note: "userfaultfd shmem" selftest is failing as of
Sep 22 mm-unstable)
[1] https://lore.kernel.org/linux-mm/YyiK8YvVcrtZo0z3@google.com/
[2] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
[3] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
[4] https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@google.com/
[5] https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@google.com/
This patch (of 10):
Extend 'mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()' to
shmem, allowing callers to ignore
/sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.
This is intended to be used by MADV_COLLAPSE, and the rationale is
analogous to the anon/file case: MADV_COLLAPSE is not coupled to
directives that advise the kernel's decisions on when THPs should be
considered eligible. shmem/tmpfs always claims large folio support,
regardless of sysfs or mount options.
[shy828301@gmail.com: test shmem_huge_force explicitly]
Link: https://lore.kernel.org/linux-mm/CAHbLzko3A5-TpS0BgBeKkx5cuOkWgLvWXQH=TdgW-baO4rPtdg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220922224046.1143204-1-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220907144521.3115321-2-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-2-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MADV_COLLAPSE is a best-effort request that attempts to set an actionable
errno value if the request cannot be fulfilled at the time. EAGAIN should
be used to communicate that a resource was temporarily unavailable, but
that the user may try again immediately.
SCAN_DEL_PAGE_LRU is an internal result code used when a page cannot be
isolated from it's LRU list. Since this, like SCAN_PAGE_LRU, is likely a
transitory state, make MADV_COLLAPSE return EAGAIN so that users know they
may reattempt the operation.
Another important scenario to consider is race with khugepaged.
khugepaged might isolate a page while MADV_COLLAPSE is interested in it.
Even though racing with khugepaged might mean that the memory has already
been collapsed, signalling an errno that is non-intrinsic to that memory
or arguments provided to madvise(2) lets the user know that future
attempts might (and in this case likely would) succeed, and avoids
false-negative assumptions by the user.
Link: https://lkml.kernel.org/r/20220922184651.1016461-1-zokeefe@google.com
Fixes: 7d8faaf155 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
By the time we lock a page in collapse_pte_mapped_thp(), the page mapped
by the address pushed onto the slot's .pte_mapped_thp[] array might have
changed arbitrarily since we last looked at it. We revalidate that the
page is still the head of a compound page, but we don't revalidate if the
compound page is of order HPAGE_PMD_ORDER before applying rmap and page
table updates.
Since the kernel now supports large folios of arbitrary order, and since
replacing page's pte mappings by a pmd mapping only makes sense for
compound pages of order HPAGE_PMD_ORDER, revalidate that the compound
order is indeed of order HPAGE_PMD_ORDER before proceeding.
Link: https://lore.kernel.org/linux-mm/CAHbLzkon+2ky8v9ywGcsTUgXM_B35jt5NThYqQKXW2YV_GUacw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220922222731.1124481-1-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The bodies of damon_{reclaim,lru_sort}_apply_parameters() contain
duplicates. This commit adds a common function
damon_set_region_biggest_system_ram_default() to remove the duplicates.
Link: https://lkml.kernel.org/r/6329f00d.a70a0220.9bb29.3678SMTPIN_ADDED_BROKEN@mx.google.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We had better return the 'err' value when calling kstrtoul() failed, so
the user will know why it really fails, there do little change, let it
return the 'err' value when failed.
Link: https://lkml.kernel.org/r/6329ebe0.050a0220.ec4bd.297cSMTPIN_ADDED_BROKEN@mx.google.com
Suggested-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 44042b4498 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists"), the per-cpu page allocators (PCP) is not
only for order-0 pages. Update the comments.
Link: https://lkml.kernel.org/r/20220918025640.208586-1-ran.xiaokai@zte.com.cn
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the beginning there is only one damos_action 'DAMOS_PAGEOUT' that need
to get the coldness score of a region for a scheme, which using
damon_pageout_score() to do that. But now there are also other
damos_action actions need the coldness score, so rename it to
damon_cold_score() to make more sense.
Link: https://lkml.kernel.org/r/1663423014-28907-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When creating hugetlb pages, the hugetlb code must first allocate
contiguous pages from a low level allocator such as buddy, cma or
memblock. The pages returned from these low level allocators are ref
counted. This creates potential issues with other code taking speculative
references on these pages before they can be transformed to a hugetlb
page. This issue has been addressed with methods and code such as that
provided in [1].
Recent discussions about vmemmap freeing [2] have indicated that it would
be beneficial to freeze all sub pages, including the head page of pages
returned from low level allocators before converting to a hugetlb page.
This helps avoid races if we want to replace the page containing vmemmap
for the head page.
There have been proposals to change at least the buddy allocator to return
frozen pages as described at [3]. If such a change is made, it can be
employed by the hugetlb code. However, as mentioned above hugetlb uses
several low level allocators so each would need to be modified to return
frozen pages. For now, we can manually freeze the returned pages. This
is done in two places:
1) alloc_buddy_huge_page, only the returned head page is ref counted.
We freeze the head page, retrying once in the VERY rare case where
there may be an inflated ref count.
2) prep_compound_gigantic_page, for gigantic pages the current code
freezes all pages except the head page. New code will simply freeze
the head page as well.
In a few other places, code checks for inflated ref counts on newly
allocated hugetlb pages. With the modifications to freeze after
allocating, this code can be removed.
After hugetlb pages are freshly allocated, they are often added to the
hugetlb free lists. Since these pages were previously ref counted, this
was done via put_page() which would end up calling the hugetlb destructor:
free_huge_page. With changes to freeze pages, we simply call
free_huge_page directly to add the pages to the free list.
In a few other places, freshly allocated hugetlb pages were immediately
put into use, and the expectation was they were already ref counted. In
these cases, we must manually ref count the page.
[1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
[2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
[mike.kravetz@oracle.com: fix NULL pointer dereference]
Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are no architectures that can have holes in the memory map within a
pageblock since commit 859a85ddf9 ("mm: remove pfn_valid_within() and
CONFIG_HOLES_IN_ZONE"). Update the corresponding comment.
Link: https://lkml.kernel.org/r/20220916072257.9639-17-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There's no need to check whether order > PAGE_ALLOC_COSTLY_ORDER again.
Minor readability improvement.
Link: https://lkml.kernel.org/r/20220916072257.9639-15-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The local variable buddy_pfn could be passed to buddy_merge_likely()
without initialization if the passed in order is MAX_ORDER - 1. This
looks buggy but buddy_pfn won't be used in this case as there's a order >=
MAX_ORDER - 2 check. Init buddy_pfn to 0 anyway to avoid possible future
misuse.
Link: https://lkml.kernel.org/r/20220916072257.9639-14-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use helper macro SZ_1K and SZ_1M to do the size conversion. Minor
readability improvement.
Link: https://lkml.kernel.org/r/20220916072257.9639-13-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's only used in mm/page_alloc.c now. Make it static.
Link: https://lkml.kernel.org/r/20220916072257.9639-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use local variable zone_idx directly since it holds the exact value of
zone_idx(). No functional change intended.
Link: https://lkml.kernel.org/r/20220916072257.9639-10-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In MIGRATE_ISOLATE case, zone freepage state shouldn't be modified as
caller will take care of it. Add missing is_migrate_isolate() here to
avoid possible unbalanced freepage state. This would happen if someone
isolates the block, and then we face an MCE failure/soft-offline on a page
within that block. __mod_zone_freepage_state() will be triggered via
below call trace which already had been triggered back when block was
isolated:
take_page_off_buddy
break_down_buddy_pages
set_page_guard
Link: https://lkml.kernel.org/r/20220916072257.9639-9-linmiaohe@huawei.com
Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The size of struct per_cpu_zonestat can be 0 on !SMP && !NUMA. In that
case, zone->per_cpu_zonestats will always equal to boot_zonestats. But in
zone_pcp_reset(), zone->per_cpu_zonestats is freed via free_percpu()
directly without checking against boot_zonestats first. boot_zonestats
will be released by free_percpu() unexpectedly.
Link: https://lkml.kernel.org/r/20220916072257.9639-7-linmiaohe@huawei.com
Fixes: 28f836b677 ("mm/page_alloc: split per cpu page lists and zone stats")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's only called by mm_init(). Add __init annotations to it.
Link: https://lkml.kernel.org/r/20220916072257.9639-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 43c95bcc51 ("mm/page_alloc: reduce duration that IRQs are
disabled for VM counters"), zone_statistics() is not called with
interrupts disabled. Update the corresponding comment.
Link: https://lkml.kernel.org/r/20220916072257.9639-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 8b10b465d0 ("mm/page_alloc: free pages in a single pass
during bulk free"), they're not used anymore. Remove them.
Link: https://lkml.kernel.org/r/20220916072257.9639-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit b92ca18e8c ("mm/page_alloc: disassociate the pcp->high from
pcp->batch"), zone_pcp_update() is only used in mm/page_alloc.c. Move
zone_pcp_update() up to avoid forward declaration and then make it static.
No functional change intended.
Link: https://lkml.kernel.org/r/20220916072257.9639-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "A few cleanup patches for mm", v2.
This series contains a few cleanup patches to remove the obsolete comments
and functions, use helper macro to improve readability and so on. More
details can be found in the respective changelogs.
This patch (of 16):
If ALLOC_KSWAPD is set, wake_all_kswapds() will be called to ensure kswapd
doesn't accidentally go to sleep. But when reserve_flags is set,
alloc_flags will be overwritten and ALLOC_KSWAPD is thus lost. Preserve
the ALLOC_KSWAPD flag in alloc_flags to ensure kswapd won't go to sleep
accidentally.
Link: https://lkml.kernel.org/r/20220916072257.9639-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220916072257.9639-2-linmiaohe@huawei.com
Fixes: 0a79cdad5e ("mm: use alloc_flags to record if kswapd can wake")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no point in returning an int from damon_set_schemes(). It always
returns 0 which is meaningless for the caller, so change it to return void
directly.
Link: https://lkml.kernel.org/r/1663341635-12675-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_lru_sort_wmarks is only used in lru_sort.c now, change it to static.
Link: https://lkml.kernel.org/r/20220915021024.4177940-2-yangyingliang@huawei.com
Fixes: 189aa3d58206 ("mm/damon/lru_sort: use watermarks parameters generator macro")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_reclaim_wmarks is only used in reclaim.c now, change it to static.
Link: https://lkml.kernel.org/r/20220915021024.4177940-1-yangyingliang@huawei.com
Fixes: 89dd02d8abd1 ("mm/damon/reclaim: use watermarks parameters generator macro")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We could use 'struct damon_target *' directly instead of 'void *' in
target_valid() operation to make code simple.
Link: https://lkml.kernel.org/r/1663241621-13293-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In damon_lru_sort_new_hot_scheme() and damon_lru_sort_new_cold_scheme(),
they have so much in common, so we can combine them into a single
function, and we just need to distinguish their differences.
[yangyingliang@huawei.com: change damon_lru_sort_stub_pattern to static]
Link: https://lkml.kernel.org/r/20220917121228.1889699-1-yangyingliang@huawei.com
Link: https://lkml.kernel.org/r/20220915133041.71819-1-sj@kernel.org
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In damon_sysfs_destroy_targets(), we call damon_target_has_pid() to check
whether the 'ctx' include a valid pid, but there no need to call
damon_target_has_pid() to check repeatedly, just need call it once.
[xhao@linux.alibaba.com: more simplified code calls damon_target_has_pid()]
Link: https://lkml.kernel.org/r/20220916133535.7428-1-xhao@linux.alibaba.com
Link: https://lkml.kernel.org/r/20220915142237.92529-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Among other data, CPU entry area holds exception stacks, so addresses from
this area can be passed to kmsan_get_metadata().
This previously led to kmsan_get_metadata() returning NULL, which in turn
resulted in a warning that triggered further attempts to call
kmsan_get_metadata() in the exception context, which quickly exhausted the
exception stack.
This patch allocates shadow and origin for the CPU entry area on x86 and
introduces arch_kmsan_get_meta_or_null(), which performs arch-specific
metadata mapping.
Link: https://lkml.kernel.org/r/20220928123219.1101883-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: 21d723a7c1409 ("kmsan: add KMSAN runtime core")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Functions implementing the a_ops->write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops->write_begin() (which accepts `void **fsdata`).
However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.
Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.
This patch covers only the following cases found by running x86 KMSAN
under syzkaller:
- generic_perform_write()
- cont_expand_zero() and generic_cont_expand_simple()
- page_symlink()
Other cases of passing uninitialized fsdata may persist in the codebase.
Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
struct pt_regs passed into IRQ entry code is set up by uninstrumented asm
functions, therefore KMSAN may not notice the registers are initialized.
kmsan_unpoison_entry_regs() unpoisons the contents of struct pt_regs,
preventing potential false positives. Unlike kmsan_unpoison_memory(), it
can be called under kmsan_in_runtime(), which is often the case in IRQ
entry code.
Link: https://lkml.kernel.org/r/20220915150417.722975-41-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heap and stack initialization is great, but not when we are trying uses of
uninitialized memory. When the kernel is built with KMSAN, having kernel
memory initialization enabled may introduce false negatives.
We disable CONFIG_INIT_STACK_ALL_PATTERN and CONFIG_INIT_STACK_ALL_ZERO
under CONFIG_KMSAN, making it impossible to auto-initialize stack
variables in KMSAN builds. We also disable
CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON to
prevent accidental use of heap auto-initialization.
We however still let the users enable heap auto-initialization at
boot-time (by setting init_on_alloc=1 or init_on_free=1), in which case a
warning is printed.
Link: https://lkml.kernel.org/r/20220915150417.722975-31-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Depending on the value of is_out kmsan_handle_urb() KMSAN either marks the
data copied to the kernel from a USB device as initialized, or checks the
data sent to the device for being initialized.
Link: https://lkml.kernel.org/r/20220915150417.722975-24-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmsan_init_shadow() scans the mappings created at boot time and creates
metadata pages for those mappings.
When the memblock allocator returns pages to pagealloc, we reserve 2/3 of
those pages and use them as metadata for the remaining 1/3. Once KMSAN
starts, every page allocated by pagealloc has its associated shadow and
origin pages.
kmsan_initialize() initializes the bookkeeping for init_task and enables
KMSAN.
Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to report uninitialized memory coming from heap allocations KMSAN
has to poison them unless they're created with __GFP_ZERO.
It's handy that we need KMSAN hooks in the places where
init_on_alloc/init_on_free initialization is performed.
In addition, we apply __no_kmsan_checks to get_freepointer_safe() to
suppress reports when accessing freelist pointers that reside in freed
objects.
Link: https://lkml.kernel.org/r/20220915150417.722975-16-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For each memory location KernelMemorySanitizer maintains two types of
metadata:
1. The so-called shadow of that location - а byte:byte mapping describing
whether or not individual bits of memory are initialized (shadow is 0)
or not (shadow is 1).
2. The origins of that location - а 4-byte:4-byte mapping containing
4-byte IDs of the stack traces where uninitialized values were
created.
Each struct page now contains pointers to two struct pages holding KMSAN
metadata (shadow and origins) for the original struct page. Utility
routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata
creation, addressing, copying and checking. mm/kmsan/report.c performs
error reporting in the cases an uninitialized value is used in a way that
leads to undefined behavior.
KMSAN compiler instrumentation is responsible for tracking the metadata
along with the kernel memory. mm/kmsan/instrumentation.c provides the
implementation for instrumentation hooks that are called from files
compiled with -fsanitize=kernel-memory.
To aid parameter passing (also done at instrumentation level), each
task_struct now contains a struct kmsan_task_state used to track the
metadata of function parameters and return values for that task.
Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares
CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The
KMSAN_SANITIZE:=n Makefile directive can be used to completely disable
KMSAN instrumentation for certain files.
Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly
created stack memory initialized.
Users can also use functions from include/linux/kmsan-checks.h to mark
certain memory regions as uninitialized or initialized (this is called
"poisoning" and "unpoisoning") or check that a particular region is
initialized.
Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some users (currently only KMSAN) may want to use spare bits in
depot_stack_handle_t. Let them do so by adding @extra_bits to
__stack_depot_save() to store arbitrary flags, and providing
stack_depot_get_extra_bits() to retrieve those flags.
Also adapt KASAN to the new prototype by passing extra_bits=0, as KASAN
does not intend to store additional information in the stack handle.
Link: https://lkml.kernel.org/r/20220915150417.722975-3-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation. The lock is taken at the beginning of
the code fault path in read mode. During truncation, it is taken in write
mode for each vma which has the file mapped. The file's size (i_size) is
modified before taking the vma lock to unmap.
How are races handled?
The page fault code checks i_size early in processing after taking the vma
lock. If the fault is beyond i_size, the fault is aborted. If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file. It could be that truncation code modifies i_size after the
check in fault code. That is OK, as truncation code will soon remove the
page. The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.
This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation. As noted above, we just let the
truncation code remove the pages.
[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The new hugetlb vma lock is used to address this race:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
doing a page table lock and allocation (huge_pte_alloc). The lock is
held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
called.
Lock ordering issues come into play when unmapping a page from all
vmas mapping the page. The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare. This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
error handling. In these routines we 'try' to obtain the vma lock and
fail to unmap if unsuccessful. Calling routines already deal with the
failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch. This routine
also tries to acquire the vma lock. If it fails, it skips the
unmapping. However, we can not have file truncation or hole punch
fail because of contention. After hugetlb_vmdelete_list, truncation
and hole punch call remove_inode_hugepages. remove_inode_hugepages
checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
hugetlb_unmap_file_page is designed to drop locks and reacquire in the
correct order to guarantee unmap success.
Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
synchronization use by vmas that could be involved in pmd sharing. This
data structure contains a rw semaphore that is the primary tool used for
synchronization.
This new structure is ref counted, so that it can exist when NOT attached
to a vma. This is only helpful in resolving lock ordering issues where
code may need to obtain the vma_lock while there are no guarantees the vma
may go away. By obtaining a ref on the structure, it can be guaranteed
that at least the rw semaphore will not go away.
Only add infrastructure for the new lock here. Actual use will be added
in subsequent patches.
[mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma. Refactor code to check if an
aligned range is shareable as this will be needed in a subsequent patch.
Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
remove_huge_page removes a hugetlb page from the page cache. Change to
hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.
Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. However, this has been shown to cause
performance/scaling issues. Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.
Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.
NOTE: Reverting this code does expose the following race condition.
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
It is unknown if the above race was ever experienced by a user. It was
discovered via code inspection when initially addressed.
In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.
Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.
hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.
The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"
48 sample Avg
next-20220913 next-20220913 next-20220913
unmodified revert i_mmap_sema locking vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s 501934 KB/s 504793 KB/s
The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings. To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory. The shared
mapping size was 250GB. For comparison, a single instance of the program
was run. Then, multiple instances were run in parallel to introduce
lock contention. Changing the locking scheme results in a significant
performance benefit.
test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 393043 395680 389932
faults per sec 24 71405 81191 79048
forks per sec 1 2802 2747 2725
forks per sec 24 439 536 500
Combined faults 24 1621 68070 53662
Combined forks 24 358 67 142
Combined test is when running both faulting program and forking program
simultaneously.
Patches 1 and 2 of this series revert c0d0381ade and 87bf91d39b which
depends on c0d0381ade. Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share. With c0d0381ade reverted, this race is exposed:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
Reverting 87bf91d39b exposes races in page fault/file truncation. When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races. This is explained in patch 9 where code associated with
these races is cleaned up.
Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization. The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare. Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing. However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().
Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.
Patch 7 refactors code in preparation for using the new lock. And, patch
8 finally adds code to make use of this new vma lock. Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock. Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.
[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/
This patch (of 9):
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. The use of i_mmap_rwsem to prevent
fault/truncate races depends on this. However, this has been shown to
cause performance/scaling issues. As a result, that code will be
reverted. Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.
In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.
Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pointer variables allocate memory first, and then judge. There is no need
to initialize the assignment.
Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.com
Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's only used in mm/filemap.c, since commit <ffa65753c431>
("mm/migrate.c: rework migration_entry_wait() to not take a pageref").
Make it static.
Link: https://lkml.kernel.org/r/20220914021738.3228011-1-sunke@kylinos.cn
Signed-off-by: Ke Sun <sunke@kylinos.cn>
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The memory-notify-based approach aims to handle meory-less nodes, however,
it just adds the complexity of code as pointed by David in thread [1].
The handling of memory-less nodes is introduced by commit 4faf8d950e
("hugetlb: handle memory hot-plug events"). >From its commit message, we
cannot find any necessity of handling this case. So, we can simply
register/unregister sysfs entries in register_node/unregister_node to
simlify the code.
BTW, hotplug callback added because in hugetlb_register_all_nodes() we
register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0fdc,
which said it was a preparation for handling memory-less nodes via memory
hotplug. Since we want to remove memory hotplug, so make sure we only
register per-node sysfs for online (N_ONLINE) nodes in
hugetlb_register_all_nodes().
https://lore.kernel.org/linux-mm/60933ffc-b850-976c-78a0-0ee6e0ea9ef0@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220914072603.60293-3-songmuchun@bytedance.com
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "simplify handling of per-node sysfs creation and removal",
v4.
This patch (of 2):
The following commit offload per-node sysfs creation and removal to a
kworker and did not say why it is needed. And it also said "I don't know
that this is absolutely required". It seems like the author was not sure
as well. Since it only complicates the code, this patch will revert the
changes to simplify the code.
39da08cb07 ("hugetlb: offload per node attribute registrations")
We could use memory hotplug notifier to do per-node sysfs creation and
removal instead of inserting those operations to node registration and
unregistration. Then, it can reduce the code coupling between node.c and
hugetlb.c. Also, it can simplify the code.
Link: https://lkml.kernel.org/r/20220914072603.60293-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220914072603.60293-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace the simple calculation with PAGE_ALIGN.
Link: https://lkml.kernel.org/r/20220913015505.1998958-1-zuoze1@huawei.com
Signed-off-by: ze zuo <zuoze1@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The name "check_free_page()" provides no information regarding its return
value when the page is indeed found to be bad.
Renaming it to "free_page_is_bad()" makes it clear that a `true' return
value means the page was bad.
And make it return a bool, not an int.
[akpm@linux-foundation.org: don't use bool as int]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use kstrtobool which is more powerful to handle all kinds of parameters
like 'Yy1Nn0' or [oO][NnFf] for "on" and "off".
Link: https://lkml.kernel.org/r/20220913071358.1812206-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the 'kdamond_wait_activation()' function or 'after_sampling()' or
'after_aggregation()' DAMON callbacks return an error, it is unnecessary
to use bool 'done' to check if kdamond should be finished. This commit
simplifies the kdamond stop mechanism by removing 'done' and break the
while loop directly in the cases.
Link: https://lkml.kernel.org/r/1663060287-30201-4-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can initialize the variable 'pid' with '-1' in pid_show() to simplify
the variable assignment operation and make the code more readable.
Link: https://lkml.kernel.org/r/1663060287-30201-3-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_lru_sort_new_{hot,cold}_scheme() have quite a lot of duplicates.
This commit factors out the duplicate to a separate function and use it
for reducing the duplicate.
Link: https://lkml.kernel.org/r/20220913174449.50645-23-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_LRU_SORT to generate the module parameters for
DAMOS watermarks using the generator macro to simplify the code and reduce
duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-22-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_RECLAIM to generate the module parameters for
DAMOS quotas using the generator macro to simplify the code and reduce
duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-21-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_LRU_SORT have module parameters for DAMOS time quota only but size
quota. This commit implements a macro for generating the module
parameters so that we can reuse later.
Link: https://lkml.kernel.org/r/20220913174449.50645-20-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM and DAMON_LRU_SORT have module parameters for DAMOS quotas
that having same names. This commit implements a macro for generating
such module parameters so that we can reuse later.
Link: https://lkml.kernel.org/r/20220913174449.50645-19-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_LRU_SORT to generate the module parameters for
DAMOS statistics using the generator macro to simplify the code and reduce
duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-18-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_RECLAIM to generate the module parameters for
DAMOS statistics using the generator macro to simplify the code and
reduce duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-17-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM and DAMON_LRU_SORT have module parameters for DAMOS
statistics that having same names. This commit implements a macro for
generating such module parameters so that we can reuse later.
Link: https://lkml.kernel.org/r/20220913174449.50645-16-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_RECLAIM to generate the module parameters for
DAMOS watermarks using the generator macro to simplify the code and reduce
duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-15-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_LRU_SORT to generate the module parameters for
DAMOS watermarks using the generator macro to simplify the code and reduce
duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM and DAMON_LRU_SORT have module parameters for watermarks
that having same names. This commit implements a macro for generating
such module parameters so that we can reuse later.
Link: https://lkml.kernel.org/r/20220913174449.50645-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_RECLAIM to generate the module parameters for
DAMON monitoring attributes using the generator macro to simplify the code
and reduce duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit makes DAMON_LRU_SORT to generate the module parameters for
DAMON monitoring attributes using the generator macro to simplify the code
and reduce duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM and DAMON_LRU_SORT have module parameters for monitoring
attributes that having same names. This commot implements a macro for
generating such module parameters so that we can reuse later.
Link: https://lkml.kernel.org/r/20220913174449.50645-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_LRU_SORT receives monitoring attributes by parameters one by one to
separate variables, and then combines those into 'struct damon_attrs'.
This commit makes the module directly stores the parameter values to a
static 'struct damon_attrs' variable and use it to simplify the code.
Link: https://lkml.kernel.org/r/20220913174449.50645-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM receives monitoring attributes by parameters one by one to
separate variables, and then combine those into 'struct damon_attrs'.
This commit makes the module directly stores the parameter values to a
static 'struct damon_attrs' variable and use it to simplify the code.
Link: https://lkml.kernel.org/r/20220913174449.50645-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Number of parameters for 'damon_set_attrs()' is six. As it could be
confusing and verbose, this commit reduces the number by receiving single
pointer to a 'struct damon_attrs'.
Link: https://lkml.kernel.org/r/20220913174449.50645-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON monitoring attributes are directly defined as fields of 'struct
damon_ctx'. This makes 'struct damon_ctx' a little long and complicated.
This commit defines and uses a struct, 'struct damon_attrs', which is
dedicated for only the monitoring attributes to make the purpose of the
five values clearer and simplify 'struct damon_ctx'.
Link: https://lkml.kernel.org/r/20220913174449.50645-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The 'struct damos' creation function, 'damon_new_scheme()', does
initialization of private fileds of 'struct damos_quota' in it. As its
verbose and makes the function unnecessarily long, this commit factors it
out to separate function.
Link: https://lkml.kernel.org/r/20220913174449.50645-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function for new 'struct damos' creation, 'damon_new_scheme()', copies
each field of the struct one by one, though it could simply copied via
struct to struct. This commit replaces the unnecessarily verbose
field-to-field copies with struct-to-struct copies to make code simple and
short.
Link: https://lkml.kernel.org/r/20220913174449.50645-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The bodies of damon_pa_{mark_accessed,deactivate_pages}() contains
duplicates. This commit factors out the common part to a separate
function and removes the duplicates.
Link: https://lkml.kernel.org/r/20220913174449.50645-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: cleanup code".
DAMON code was not so clean from the beginning, but it has been too much
nowadays, especially due to the duplicates in DAMON_RECLAIM and
DAMON_LRU_SORT. This patchset cleans some of the mess.
This patch (of 22):
The 'switch-case' statement in 'damon_va_apply_scheme()' function provides
a 'case' for every supported DAMOS action while all not-yet-supported
DAMOS actions fall through the 'default' case, and comment it so that
people can easily know which actions are supported. Its counterpart in
'paddr', 'damon_pa_apply_scheme()', however, doesn't. This commit makes
the 'paddr' side function follows the pattern of 'vaddr' for better
readability and consistency.
Link: https://lkml.kernel.org/r/20220913174449.50645-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20220913174449.50645-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In damon_lru_sort_apply_parameters(), we can use damon_set_schemes() to
replace the way of creating the first 'scheme' in original code, this
makes the code look cleaner.
Link: https://lkml.kernel.org/r/20220911005917.835-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kdamond is implemented as a periodical split-merge pattern, which will
create and destroy regions possibly at high frequency (hundreds or even
thousands of per sec), depending on the number of regions and aggregation
period. In that case, kmalloc and kfree could bring speed and space
overheads, which can be improved by using a private kmem cache.
[set_pte_at@outlook.com: creating kmem cache for damon regions by KMEM_CACHE()]
Link: https://lkml.kernel.org/r/Message-ID:
Link: https://lkml.kernel.org/r/TYCP286MB2323DA1894FA55BB9CF90978CA449@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM
Signed-off-by: Dawei Li <set_pte_at@outlook.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can use the 'damon_sysfs_kdamond_running()' wrapper directly to check
if the kdamond is running in 'damon_sysfs_turn_damon_on()'.
Link: https://lkml.kernel.org/r/1662995513-24489-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>