This commit changes truncate_inode_pages_range() so it can handle non
page aligned regions of the truncate. Currently we can hit BUG_ON when
the end of the range is not page aligned, but we can handle unaligned
start of the range.
Being able to handle non page aligned regions of the page can help file
system punch_hole implementations and save some work, because once we're
holding the page we might as well deal with it right away.
In previous commits we've changed ->invalidatepage() prototype to accept
'length' argument to be able to specify range to invalidate. No we can
use that new ability in truncate_inode_pages_range().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Commit 8a965b3baa ("mm, slab_common: Fix bootstrap creation of kmalloc
caches") introduced a regression that caused us to crash early during
boot. The commit was introducing ordering of slab creation, making sure
two odd-sized slabs were created after specific powers of two sizes.
But, if any of the power of two slabs were created earlier during boot,
slabs at index 1 or 2 might not get created at all. This patch makes
sure none of the slabs get skipped.
Tony Lindgren bisected this down to the offending commit, which really
helped because bisect kept bringing me to almost but not quite this one.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
Bunch of performance improvements and cleanups Zach Brown and I have
been working on. The code should be pretty solid at this point, though
it could of course use more review and testing.
The results in my testing are pretty impressive, particularly when an
ioctx is being shared between multiple threads. In my crappy synthetic
benchmark, with 4 threads submitting and one thread reaping completions,
I saw overhead in the aio code go from ~50% (mostly ioctx lock
contention) to low single digits. Performance with ioctx per thread
improved too, but I'd have to rerun those benchmarks.
The reason I've been focused on performance when the ioctx is shared is
that for a fair number of real world completions, userspace needs the
completions aggregated somehow - in practice people just end up
implementing this aggregation in userspace today, but if it's done right
we can do it much more efficiently in the kernel.
Performance wise, the end result of this patch series is that submitting
a kiocb writes to _no_ shared cachelines - the penalty for sharing an
ioctx is gone there. There's still going to be some cacheline
contention when we deliver the completions to the aio ringbuffer (at
least if you have interrupts being delivered on multiple cores, which
for high end stuff you do) but I have a couple more patches not in this
series that implement coalescing for that (by taking advantage of
interrupt coalescing). With that, there's basically no bottlenecks or
performance issues to speak of in the aio code.
This patch:
use_mm() is used in more places than just aio. There's no need to mention
callers when describing the function.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.
This is a regression introduced in commit 40716e2924 ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.
To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881
[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: <iceman_dvd@yahoo.com>
Cc: Steven Truelove <steven.truelove@utoronto.ca>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.
This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.
The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull slab changes from Pekka Enberg:
"The bulk of the changes are more slab unification from Christoph.
There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
the mix."
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
mm, slab_common: Fix bootstrap creation of kmalloc caches
slab: Return NULL for oversized allocations
mm: slab: Verify the nodeid passed to ____cache_alloc_node
slub: tid must be retrieved from the percpu area of the current processor
slub: Do not dereference NULL pointer in node_match
slub: add 'likely' macro to inc_slabs_node()
slub: correct to calculate num of acquired objects in get_partial_node()
slub: correctly bootstrap boot caches
mm/sl[au]b: correct allocation type check in kmalloc_slab()
slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
slab: Handle ARCH_DMA_MINALIGN correctly
slab: Common definition for kmem_cache_node
slab: Rename list3/l3 to node
slab: Common Kmalloc cache determination
stat: Use size_t for sizes instead of unsigned
slab: Common function to create the kmalloc array
slab: Common definition for the array of kmalloc caches
slab: Common constants for kmalloc boundaries
slab: Rename nodelists to node
slab: Common name for the per node structures
...
For SLAB the kmalloc caches must be created in ascending sizes in order
for the OFF_SLAB sub-slab cache to work properly.
Create the non power of two caches immediately after the prior power of
two kmalloc cache. Do not create the non power of two caches before all
other caches.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Christoph Lamete <cl@linux.com>
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The inline path seems to have changed the SLAB behavior for very large
kmalloc allocations with commit e3366016 ("slab: Use common
kmalloc_index/kmalloc_size functions"). This patch restores the old
behavior but also adds diagnostics so that we can figure where in the
code these large allocations occur.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
[ penberg@kernel.org: use WARN_ON_ONCE ]
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
If the nodeid is > num_online_nodes() this can cause an Oops and a
panic(). The purpose of this patch is to assert if this condition is
true to aid debugging efforts rather than some random NULL pointer
dereference or page fault.
This patch is in response to BZ#42967 [1]. Using VM_BUG_ON so it's used
only when CONFIG_DEBUG_VM is set, given that ____cache_alloc_node() is a
hot code path.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=42967
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
cleancache_ops is used to decide whether backend is registered.
So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of using a backend_registered to determine whether a backend is
enabled. This allows us to remove the backend_register check and just
do 'if (cleancache_ops)'
[v1: Rebase on top of b97c4b430b0a (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to cleancache even after filesystems were mounted. Calls to
init_fs and init_shared_fs are remembered as fake poolids but no real
tmem_pools created. On backend registration the fake poolids are mapped
to real poolids and respective tmem_pools.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Florian Schmaus <fschmaus@gmail.com>
Signed-off-by: Andor Daam <andor.daam@googlemail.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Minor fixes: used #define for some values and bools]
[v2: Removed CLEANCACHE_HAS_LAZY_INIT]
[v3: Added more comments, added a lock for [shared_|]fs_poolid_map]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Frontswap initialization routine depends on swap_lock, which want to be
atomic about frontswap's first appearance. IOW, frontswap is not present
and will fail all calls OR frontswap is fully functional but if new
swap_info_struct isn't registered by enable_swap_info, swap subsystem
doesn't start I/O so there is no race between init procedure and page I/O
working on frontswap.
So let's remove unnecessary swap_lock dependency.
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
[v1: Rebased on my branch, reworked to work with backends loading late]
[v2: Added a check for !map]
[v3: Made the invalidate path follow the init path]
[v4: Address comments by Wanpeng Li <liwanp@linux.vnet.ibm.com>]
Signed-off-by: Konrad Rzeszutek Wilk <konrad@darnok.org>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After allowing tmem backends to build/run as modules, frontswap_enabled
always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
whether backend is registered, mv it into frontswap.c using fronstswap_ops
to make the decision.
frontswap_set/clear are not used outside frontswap, so don't export them.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This simplifies the code in the frontswap - we can get rid of the
'backend_registered' test and instead check against frontswap_ops.
[v1: Rebase on top of 703ba7fe5e (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is
attempted.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Florian Schmaus <fschmaus@gmail.com>
Signed-off-by: Andor Daam <andor.daam@googlemail.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu <lliubbo@gmail.com> recommendations]
[v4: Fix up per Andrew's comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second batch of fixes from Andrew Morton:
- various misc bits
- some printk updates
- a new "SRAM" driver.
- MAINTAINERS updates
- the backlight driver queue
- checkpatch updates
- a few init/ changes
- a huge number of drivers/rtc changes
- fatfs updates
- some lib/idr.c work
- some renaming of the random driver interfaces
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits)
net: rename random32 to prandom
net/core: remove duplicate statements by do-while loop
net/core: rename random32() to prandom_u32()
net/netfilter: rename random32() to prandom_u32()
net/sched: rename random32() to prandom_u32()
net/sunrpc: rename random32() to prandom_u32()
scsi: rename random32() to prandom_u32()
lguest: rename random32() to prandom_u32()
uwb: rename random32() to prandom_u32()
video/uvesafb: rename random32() to prandom_u32()
mmc: rename random32() to prandom_u32()
drbd: rename random32() to prandom_u32()
kernel/: rename random32() to prandom_u32()
mm/: rename random32() to prandom_u32()
lib/: rename random32() to prandom_u32()
x86: rename random32() to prandom_u32()
x86: pageattr-test: remove srandom32 call
uuid: use prandom_bytes()
raid6test: use prandom_bytes()
sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
...
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
Use preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg is not referenced, so it can be destroyed at anytime right
after we exit rcu read section, so it's not safe to access it.
To fix this, we call css_tryget() to get a reference while we're still
in rcu read section.
This also removes a bogus comment above __memcg_create_cache_enqueue().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are times when HIGHMEM is enabled, but we don't prefer
CONFIG_BOUNCE to be enabled. CONFIG_BOUNCE can reduce the block device
throughput, and this is not ideal for machines where we don't gain much
by enabling it. So provide an option to deselect CONFIG_BOUNCE. The
observation was made while measuring eMMC throughput using iozone on an
ARM device with 1GB RAM.
Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we do memset() before reserving the area. This may not cause
any problem, but it is somewhat weird. So change execution order.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused argument and make function static, because there is no user
outside of nobootmem.c
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PFN_PHYS() is a phys_addr_t, which can be u32 or u64.
Fix the build warning when phys_addr_t is u32.
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'unsigned int' [-Wformat]: => 1685:3
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]: => 1685:3
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As pointed out by Andrew Morton, the swap-over-NFS writeback is not
setting PageWriteback before it is queued for direct IO. While swap
pages do not participate in BDI or process dirty accounting and the IO
is synchronous, the writeback bit is still required and not setting it
in this case was an oversight. swapoff depends on the page writeback to
synchronoise all pending writes on a swap page before it is reused.
Swapcache freeing and reuse depend on checking the PageWriteback under
lock to ensure the page is safe to reuse.
Direct IO handlers and the direct IO handler for NFS do not deal with
PageWriteback as they are synchronous writes. In the case of NFS, it
schedules pages (or a page in the case of swap) for IO and then waits
synchronously for IO to complete in nfs_direct_write(). It is
recognised that this is a slowdown from normal swap handling which is
asynchronous and uses a completion handler. Shoving PageWriteback
handling down into direct IO handlers looks like a bad fit to handle the
swap case although it may have to be dealt with some day if swap is
converted to use direct IO in general and bmap is finally done away
with. At that point it will be necessary to refit asynchronous direct
IO with completion handlers onto the swap subsystem.
As swapcache currently depends on PageWriteback to protect against
races, this patch sets PageWriteback under the page lock before queueing
it for direct IO. It is cleared when the direct IO handler returns. IO
errors are treated similarly to the direct-to-bio case except PageError
is not set as in the case of swap-over-NFS, it is likely to be a
transient error.
It was asked what prevents such a page being reclaimed in parallel.
With this patch applied, such a page will now be skipped (most of the
time) or blocked until the writeback completes. Reclaim checks
PageWriteback under the page lock before calling try_to_free_swap and
the page lock should prevent the page being requeued for IO before it is
freed.
This and Jerome's related patch should considered for -stable as far
back as 3.6 when swap-over-NFS was introduced.
[akpm@linux-foundation.org: use pr_err_ratelimited()]
[akpm@linux-foundation.org: remove hopefully-unneeded cast in printk]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 62c230bc17 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages"), swap_writepage()
calls direct_IO on swap files. However, in that case the page isn't
redirtied if I/O fails, and is therefore handled afterwards as if it has
been successfully written to the swap file, leading to memory corruption
when the page is eventually swapped back in.
This patch sets the page dirty when direct_IO() fails. It fixes a
memory corruption that happened while using swap-over-NFS.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A memcg may livelock when oom if the process that grabs the hierarchy's
oom lock is never the first process with PF_EXITING set in the memcg's
task iteration.
The oom killer, both global and memcg, will defer if it finds an
eligible process that is in the process of exiting and it is not being
ptraced. The idea is to allow it to exit without using memory reserves
before needlessly killing another process.
This normally works fine except in the memcg case with a large number of
threads attached to the oom memcg. In this case, the memcg oom killer
only gets called for the process that grabs the hierarchy's oom lock;
all others end up blocked on the memcg's oom waitqueue. Thus, if the
process that grabs the hierarchy's oom lock is never the first
PF_EXITING process in the memcg's task iteration, the oom killer is
constantly deferred without anything making progress.
The fix is to give PF_EXITING processes access to memory reserves so
that we've marked them as oom killed without any iteration. This allows
__mem_cgroup_try_charge() to succeed so that the process may exit. This
makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
immediately granted for processes with pending SIGKILLs and those in the
exit path, to be equivalent to what is done for the global oom killer.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current implementation of huge zero page uses pfn value 0 to indicate
that the page hasn't allocated yet. It assumes that buddy page
allocator can't return page with pfn == 0.
Let's rework the code to store 'struct page *' of huge zero page, not
its pfn. This way we can avoid the weak assumption.
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This might cause a use-after-free bug.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two convenient ways to report errors to userspace
1) retun error to original syscall for example write(2)
2) mark mapping with error flag and return it on later fsync(2)
Second one is broken if (mapping->nrpages == 0) This is real-life
situation because after error pages are likey to be truncated or
invalidated.
We have to return an error regardless to number of pages in the mapping.
#Original testcase: git@github.com:dmonakhov/xfstests.git
MOUNT_OPTIONS="-b1024"
./check shared/305
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no comment for parameter nid of memblock_insert_region().
This patch adds comment for it.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.
Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916
With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272
fallback allocation is reduced a lot.
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prevent flooding the swap device with writebacks, frontswap backends
need to count and limit the number of outstanding writebacks. The
incrementing of the counter can be done before the call to
__swap_writepage(). However, the caller must receive a notification
when the writeback completes in order to decrement the counter.
To achieve this functionality, this patch modifies __swap_writepage() to
take the bio completion callback function as an argument.
end_swap_bio_write(), the normal bio completion function, is also made
non-static so that code doing the accounting can call it after the
accounting is done.
There should be no behavioural change to existing code.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_writepage() is currently where frontswap hooks into the swap write
path to capture pages with the frontswap_store() function. However, if
a frontswap backend wants to "resume" the writeback of a page to the
swap device, it can't call swap_writepage() as the page will simply
reenter the backend.
This patch separates swap_writepage() into a top and bottom half, the
bottom half named __swap_writepage() to allow a frontswap backend, like
zswap, to resume writeback beyond the frontswap_store() hook.
__add_to_swap_cache() is also made non-static so that the page for which
writeback is to be resumed can be added to the swap cache.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a corner case for MAP_FIXED when requested mapping length is larger
than rlimit for virtual memory. In such case any overlapping mappings
are unmapped before we check for the limit and return ENOMEM.
The check is moved before the loop that unmaps overlapping parts of
existing mappings. When we are about to hit the limit (currently mapped
pages + len > limit) we scan for overlapping pages and check again
accounting for them.
This fixes situation when userspace program expects that the previous
mappings are preserved after the mmap() syscall has returned with error.
(POSIX clearly states that successfull mapping shall replace any
previous mappings.)
This corner case was found and can be tested with LTP testcase:
testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c
In this case the mmap, which is clearly over current limit, unmaps
dynamic libraries and the testcase segfaults right after returning into
userspace.
I've also looked at the second instance of the unmapping loop in the
do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
The brk() syscall checks for overlapping mappings and bails out when
there are any (so it can't be triggered from the brk syscall). The
vm_brk() is called only from binmft handlers so it shouldn't be
triggered unless binmft handler created overlapping mappings.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In madvise(), there doesn't seem to be any reason for taking the
¤t->mm->mmap_sem before start and len_in have been validated.
Incidentally, this removes the need for the out: label.
[akpm@linux-foundation.org: s/out_plug/out/, per David]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
pseries will return -EOPNOTSUPP if unsupported.
Adding an #ifdef causes several other functions it depends on to also
become unnecessary, which saves in .text when disabled (it's disabled in
most defconfigs besides powerpc, including x86). remove_memory_block()
becomes static since it is not referenced outside of
drivers/base/memory.c.
Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
and disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change __remove_pages() to call release_mem_region_adjustable(). This
allows a requested memory range to be released from the iomem_resource
table even if it does not match exactly to an resource entry but still
fits into. The resource entries initialized at bootup usually cover the
whole contiguous memory ranges and may not necessarily match with the
size of memory hot-delete requests.
If release_mem_region_adjustable() failed, __remove_pages() emits a
warning message and continues to proceed as it was the case with
release_mem_region(). release_mem_region(), which is defined to
__release_region(), emits a warning message and returns no error since a
void function.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>