WSL2-Linux-Kernel/drivers/base
Oscar Salvador a08a2ae346 mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section.  Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
    This can even lead to extreme cases where system goes OOM because
    the physically hotplugged memory depletes the available memory before
    it is onlined.
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.  That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables.  This implementation uses the beginning of the hotplugged memory
for that purpose.

There are some non-obviously things to consider though.

Vmemmap pages are allocated/freed during the memory hotplug events
(add_memory_resource(), try_remove_memory()) when the memory is
added/removed.  This means that the reserved physical range is not
online although it is used.  The most obvious side effect is that
pfn_to_online_page() returns NULL for those pfns.  The current design
expects that this should be OK as the hotplugged memory is considered a
garbage until it is onlined.  For example hibernation wouldn't save the
content of those vmmemmaps into the image so it wouldn't be restored on
resume but this should be OK as there no real content to recover anyway
while metadata is reachable from other data structures (e.g.  vmemmap
page tables).

The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of
{on,off}line_pages is to make each initialization specific to the
purpose rather than special case them in a single function.

As per above, the functions that are introduced are:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
   kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
   fully span.

 - mhp_deinit_memmap_on_memory:
   Offlines as many sections as vmemmap pages fully span, removes the
   range from zhe zone by remove_pfn_range_from_zone(), and calls
   kasan_remove_zero_shadow() for the range.

The new function memory_block_online() calls mhp_init_memmap_on_memory()
before doing the actual online_pages().  Should online_pages() fail, we
clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().  This is necessary because
offline_pages() tears down some structures based on the fact whether the
node or the zone become empty.  If offline_pages() fails, we account back
vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:26 -07:00
..
firmware_loader drivers: base: fix some kernel-doc markups 2020-11-09 18:56:49 +01:00
power Merge branches 'pm-docs' and 'pm-tools' 2021-04-26 17:00:14 +02:00
regmap regmap-irq: Fix dereference of a potentially null d->virt_buf 2021-04-07 16:58:33 +01:00
test kunit: software node: adhear to KUNIT formatting standard 2021-04-15 08:56:27 +02:00
Kconfig RISC-V Patches for the 5.12 Merge Window 2021-02-26 10:28:35 -08:00
Makefile numa: Move numa implementation to common code 2021-01-14 15:08:55 -08:00
arch_numa.c arch_numa: fix common code printing of phys_addr_t 2021-02-18 23:18:04 -08:00
arch_topology.c arch_topology: Export arch_freq_scale and helpers 2021-03-12 10:35:57 +05:30
attribute_container.c driver core: attribute_container: remove kernel-doc warnings 2021-04-02 16:40:07 +02:00
auxiliary.c driver core: auxiliary bus: Remove unneeded module bits 2021-03-23 10:47:55 +01:00
base.h driver core: Improve fw_devlink & deferred_probe_timeout interaction 2021-04-05 09:17:56 +02:00
bus.c drivers: base: change 'driver_create_groups' to 'driver_add_groups' in printk 2021-01-27 14:35:09 +01:00
cacheinfo.c
class.c drivers: base: fix some kernel-doc markups 2020-11-09 18:56:49 +01:00
component.c driver core: component: remove dentry pointer in "struct master" 2021-03-23 10:49:02 +01:00
container.c
core.c driver core: Improve fw_devlink & deferred_probe_timeout interaction 2021-04-05 09:17:56 +02:00
cpu.c drivers/base/cpu: remove redundant assignment of variable retval 2021-03-23 14:56:50 +01:00
dd.c Linux 5.12-rc7 2021-04-14 19:53:39 +02:00
devcoredump.c devcoredump: fix kernel-doc warning 2021-04-02 16:40:08 +02:00
devres.c driver core: Replace printf() specifier and drop unneeded casting 2021-04-02 17:02:45 +02:00
devtmpfs.c devtmpfs: actually reclaim some init memory 2021-03-23 14:57:35 +01:00
driver.c
firmware.c
hypervisor.c
init.c driver core: auxiliary bus: Fix calling stage for auxiliary bus init 2021-02-11 08:43:03 +01:00
isa.c isa: Make the remove callback for isa drivers return void 2021-01-26 07:42:27 +01:00
map.c
memory.c mm,memory_hotplug: allocate memmap from the added memory range 2021-05-05 11:27:26 -07:00
module.c
node.c node: fix device cleanups in error handling code 2021-04-10 11:10:21 +02:00
pinctrl.c
platform-msi.c platform-msi: fix kernel-doc warnings 2021-04-02 16:40:08 +02:00
platform.c Revert "driver core: platform: Make platform_get_irq_optional() optional" 2021-04-07 11:47:56 +02:00
property.c media: device property: Call fwnode_graph_get_endpoint_by_id() for fwnode->secondary 2021-01-26 19:24:18 +01:00
soc.c soc: fix comment for freeing soc_dev_attr 2020-12-09 19:46:31 +01:00
swnode.c software node: Allow node addition to already existing device 2021-04-15 10:36:07 +02:00
syscore.c
topology.c
transport_class.c