WSL2-Linux-Kernel/mm
KOSAKI Motohiro 31f1de46b9 mempolicy: silently restrict nodemask to allowed nodes
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that problem.

Some background:

numactl --interleave=all calls set_mempolicy(2) with a fully populated
[out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
calls contextualize_policy() which requires that the nodemask be a
subset of the current task's mems_allowed; else EINVAL will be returned.

A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
i.e., nodes with memory.  So, a fully populated nodemask will be
declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the nodemask
that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy() will
complain [again, resulting in EINVAL] if the nodemask contains any
memoryless nodes.  This is somewhat redundant as mpol_new() will remove
memoryless nodes for interleave policy, as will bind_zonelist()--called
by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy logic to:
   a) remember whether the incoming node mask is empty.
   b) if not, restrict the nodemask to allowed nodes, as is
      currently done in-line for mbind().  This guarantees
      that the resulting mask includes only nodes with memory.

      NOTE:  this is a [benign, IMO] change in behavior for
             set_mempolicy().  Dis-allowed nodes will be
             silently ignored, rather than returning an error.

   c) fold this code into mpol_check_policy(), replace 2 calls to
      contextualize_policy() to call mpol_check_policy() directly
      and remove contextualize_policy().

2) In existing mpol_check_policy() logic, after "contextualization":
   a) MPOL_DEFAULT:  require that in coming mask "was_empty"
   b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
      contains at least one node.
   c) add a case for MPOL_PREFERRED:  if in coming was not empty
      and resulting mask IS empty, user specified invalid nodes.
      Return EINVAL.
   c) remove the now redundant check for memoryless nodes

3) remove the now redundant masking of policy nodes for interleave
   policy from mpol_new().

4) Now that mpol_check_policy() contextualizes the nodemask, remove
   the in-line nodes_and() from sys_mbind().  I believe that this
   restores mbind() to the behavior before the memoryless-nodes
   patch series.  E.g., we'll no longer treat an invalid nodemask
   with MPOL_PREFERRED as local allocation.

[ Patch history:

  v1 -> v2:
   - Communicate whether or not incoming node mask was empty to
     mpol_check_policy() for better error checking.
   - As suggested by David Rientjes, remove the now unused
     cpuset_nodes_subset_current_mems_allowed() from cpuset.h

  v2 -> v3:
   - As suggested by Kosaki Motohito, fold the "contextualization"
     of policy nodemask into mpol_check_policy().  Looks a little
     cleaner. ]

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by:      KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by:       David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-11 20:48:29 -08:00
..
Kconfig
Makefile Memory controller: cgroups setup 2008-02-07 08:42:18 -08:00
allocpercpu.c PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storage 2008-02-06 10:41:04 -08:00
backing-dev.c
bootmem.c Introduce flags for reserve_bootmem() 2008-02-07 08:42:25 -08:00
bounce.c
dmapool.c
fadvise.c check ADVICE of fadvise64_64 even if get_xip_page is given 2008-02-05 09:44:19 -08:00
filemap.c kill do_generic_mapping_read 2008-02-08 09:22:39 -08:00
filemap_xip.c Use pgoff_t instead of unsigned long 2008-02-08 09:22:32 -08:00
fremap.c sys_remap_file_pages: fix ->vm_file accounting 2008-02-05 09:44:07 -08:00
highmem.c mm: remove fastcall from mm/ 2008-02-05 09:44:18 -08:00
hugetlb.c hugetlb: add locking for overcommit sysctl 2008-02-08 09:22:23 -08:00
internal.h set_page_refcounted() VM_BUG_ON fix 2008-02-05 09:44:19 -08:00
madvise.c
memcontrol.c memcontrol: add vm_match_cgroup() 2008-02-09 11:08:33 -08:00
memory.c Be more robust about bad arguments in get_user_pages() 2008-02-11 20:44:44 -08:00
memory_hotplug.c Page allocator: clean up pcp draining functions 2008-02-05 09:44:17 -08:00
mempolicy.c mempolicy: silently restrict nodemask to allowed nodes 2008-02-11 20:48:29 -08:00
mempool.c
migrate.c bugfix for memory cgroup controller: migration under memory controller fix 2008-02-07 08:42:19 -08:00
mincore.c
mlock.c
mmap.c mm: special mapping nopage 2008-02-08 18:57:39 -08:00
mmzone.c
mprotect.c
mremap.c
msync.c
nommu.c nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces. 2008-02-05 09:44:21 -08:00
oom_kill.c oom: add sysctl to enable task memory dump 2008-02-07 08:42:19 -08:00
page-writeback.c writeback: speed up writeback of big dirty files 2008-02-05 09:44:19 -08:00
page_alloc.c misc: removal of final callers using fastcall 2008-02-08 09:22:31 -08:00
page_io.c mm: fix PageUptodate data race 2008-02-05 09:44:19 -08:00
page_isolation.c
pagewalk.c maps4: introduce a generic page walker 2008-02-05 09:44:16 -08:00
pdflush.c
prio_tree.c
quicklist.c
readahead.c
rmap.c memcontrol: add vm_match_cgroup() 2008-02-09 11:08:33 -08:00
shmem.c mount-options-fix-tmpfs-fix 2008-02-08 09:22:41 -08:00
shmem_acl.c
slab.c
slob.c slob: reduce external fragmentation by using three free lists 2008-02-05 09:44:19 -08:00
slub.c SLUB: fix checkpatch warnings 2008-02-07 17:52:39 -08:00
sparse-vmemmap.c
sparse.c mm: fix section mismatch warning in sparse.c 2008-02-05 09:44:19 -08:00
swap.c Memory controller: add per cgroup LRU and reclaim 2008-02-07 08:42:18 -08:00
swap_state.c memcgroup: revert swap_state mods 2008-02-07 08:42:20 -08:00
swapfile.c memcgroup: reinstate swapoff mod 2008-02-07 08:42:19 -08:00
thrash.c
tiny-shmem.c Remove unused code from mm/tiny-shmem.c 2008-02-05 09:44:17 -08:00
truncate.c page migraton: handle orphaned pages 2008-02-05 09:44:19 -08:00
util.c
vmalloc.c CONFIG_HIGHPTE vs. sub-page page tables. 2008-02-08 09:22:42 -08:00
vmscan.c per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity 2008-02-07 08:42:22 -08:00
vmstat.c vmstat: remove prefetch 2008-02-05 09:44:18 -08:00