WSL2-Linux-Kernel/Documentation
Colin Cross 9a10064f56 mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use.  At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.).  Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.

On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc.  This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages.  It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).

If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.

Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes.  It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace.  Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace.  It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.

This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas.  The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].

Userspace can set the name for a region of memory by calling

   prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)

Setting the name to NULL clears it.  The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.

Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps.  Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.

The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string.  Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name.  The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.

CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature.  It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.

The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names.  In that design, name
pointers could be shared between vmas.  However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.

One big concern is about fork() performance which would need to strdup
anonymous vma names.  Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4].  I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.

This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name.  Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.

[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

Changes for prctl(2) manual page (in the options section):

PR_SET_VMA
	Sets an attribute specified in arg2 for virtual memory areas
	starting from the address specified in arg3 and spanning the
	size specified	in arg4. arg5 specifies the value of the attribute
	to be set. Note that assigning an attribute to a virtual memory
	area might prevent it from being merged with adjacent virtual
	memory areas due to the difference in that attribute's value.

	Currently, arg2 must be one of:

	PR_SET_VMA_ANON_NAME
		Set a name for anonymous virtual memory areas. arg5 should
		be a pointer to a null-terminated string containing the
		name. The name length including null byte cannot exceed
		80 bytes. If arg5 is NULL, the name of the appropriate
		anonymous virtual memory areas will be reset. The name
		can contain only printable ascii characters (including
                space), except '[',']','\','$' and '`'.

                This feature is available only if the kernel is built with
                the CONFIG_ANON_VMA_NAME option enabled.

[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
  Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
 added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
 work here was done by Colin Cross, therefore, with his permission, keeping
 him as the author]

Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:27 +02:00
..
ABI f2fs-for-5.16-rc1 2021-11-13 11:20:22 -08:00
PCI
RCU
accounting
admin-guide memcg: add per-memcg vmalloc stat 2022-01-15 16:30:27 +02:00
arm Documentation: arm: marvell: Fix link to armada_1000_pb.pdf document 2021-11-15 02:49:56 -07:00
arm64 arm64: update PAC description for kernel 2021-12-02 10:13:35 +00:00
block This is a relatively unexciting cycle for documentation. 2021-11-02 22:11:39 -07:00
bpf libbpf: update index.rst reference 2021-11-17 06:12:14 -07:00
cdrom
core-api Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
cpu-freq cpufreq: docs: Update core.rst 2021-12-01 20:02:11 +01:00
crypto crypto: engine - Add KPP Support to Crypto Engine 2021-10-29 21:04:03 +08:00
dev-tools Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
devicetree dt-bindings: spi: cadence-quadspi: document "intel,socfpga-qspi" 2021-12-27 04:20:05 -06:00
doc-guide docs: Update Sphinx requirements 2021-11-15 02:47:22 -07:00
driver-api cxl for v5.16 2021-11-08 11:49:48 -08:00
fault-injection
fb
features parisc: Move thread_info into task struct 2021-11-01 07:35:59 +01:00
filesystems mm: add a field to store names for private anonymous memory 2022-01-15 16:30:27 +02:00
firmware-guide Documentation: ACPI: Fix non-D0 probe _DSC object example 2021-11-10 13:59:12 +01:00
firmware_class
fpga
gpu drm-misc-next for 5.16: 2021-11-05 13:50:15 +10:00
hid
hwmon Driver core changes for 5.16-rc1 2021-11-04 08:32:38 -07:00
i2c Docs: Fixes link to I2C specification 2021-12-31 14:39:28 +01:00
ia64
ide
iio
infiniband
input
isdn
kbuild Kbuild updates for v5.16 2021-11-08 09:15:45 -08:00
kernel-hacking docs: futex: Fix kernel-doc references 2021-10-19 17:27:05 +02:00
leds leds: add new LED_FUNCTION_PLAYER for player LEDs for game controllers. 2021-10-27 09:49:29 +02:00
litmus-tests
livepatch
locking Documentation/locking/locktypes: Update migrate_disable() bits. 2021-11-30 15:40:31 +01:00
m68k
maintainer docs: use the lore redirector everywhere 2021-10-12 13:58:19 -06:00
mhi
mips
misc-devices
netlabel
networking Documentation: fix outdated interpretation of ip_no_pmtu_disc 2021-12-30 13:28:04 +00:00
nios2
nvdimm
openrisc
parisc
pcmcia
power Documentation: power: Describe 'advanced' and 'simple' EM models 2021-11-10 21:26:34 +01:00
powerpc
process Documentation: Add minimum pahole version 2021-11-29 14:48:00 -07:00
riscv
s390
scheduler
scsi
security net,lsm,selinux: revert the security_sctp_assoc_established() hook 2021-11-14 12:21:53 +00:00
sh
sound ALSA: hda/realtek: Add new alc285-hp-amp-init model 2021-12-14 10:44:26 +01:00
sparc
sphinx
sphinx-static
spi
staging
target
timers
trace docs: ftrace: fix the wrong path of tracefs 2021-11-15 02:50:39 -07:00
translations doc/zh_CN: fix a translation error in management-style 2021-11-15 02:53:30 -07:00
usb
userspace-api Char/Misc driver update for 5.16-rc1 2021-11-04 08:21:47 -07:00
virt Merge branch 'kvm-sev-move-context' into kvm-master 2021-11-11 11:02:58 -05:00
vm mm/debug_vm_pgtable: update comments regarding migration swap entries 2022-01-15 16:30:26 +02:00
w1
watchdog
x86 - Add the model number of a new, Raptor Lake CPU, to intel-family.h 2021-11-14 09:29:03 -08:00
xtensa
.gitignore
COPYING-logo
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
arch.rst
asm-annotations.rst docs: use the lore redirector everywhere 2021-10-12 13:58:19 -06:00
atomic_bitops.txt
atomic_t.txt
conf.py docs: conf.py: fix support for Readthedocs v 1.0.0 2021-11-29 14:27:52 -07:00
docutils.conf
dontdiff
index.rst
logo.gif
memory-barriers.txt
watch_queue.rst