WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Mark Fasheh	6967614761	ocfs2: release page lock before calling ->page_mkwrite __do_fault() was calling ->page_mkwrite() with the page lock held, which violates the locking rules for that callback. Release and retake the page lock around the callback to avoid deadlocking file systems which manually take it. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	54cb8821de	mm: merge populate and nopage into fault (fixes nonlinear) Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. ->populate is a layering violation because the filesystem/pagecache code should need to know anything about the virtual memory mapping. The hitch here is that the ->nopage handler didn't pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway (because that's a similar layering violation). Having the populate handler install the pte itself is likewise a nasty thing to be doing. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. [akpm@linux-foundation.org: cleanup] [randy.dunlap@oracle.com: doc. fixes for readahead] [akpm@linux-foundation.org: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Nick Piggin	d00806b183	mm: fix fault vs invalidate race for linear mappings Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:41 -07:00
Jeremy Fitzhardinge	5f4352fbff	Allocate and free vmalloc areas Allocate/release a chunk of vmalloc address space: alloc_vm_area reserves a chunk of address space, and makes sure all the pagetables are constructed for that address range - but no pages. free_vm_area releases the address space range. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: "Jan Beulich" <JBeulich@novell.com> Cc: "Andi Kleen" <ak@muc.de>	2007-07-18 08:47:41 -07:00
Jeremy Fitzhardinge	1e66df3ee3	add kstrndup Add a kstrndup function, modelled on strndup. Like strndup this returns a string copied into its own allocated memory, but it copies no more than the specified number of bytes from the source. Remove private strndup() from irda code. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@mandriva.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Panagiotis Issaris <takis@issaris.org> Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>	2007-07-18 08:47:39 -07:00
Meelap Shah	c2f1a551de	knfsd: nfsd4: vary maximum delegation limit based on RAM size Our original NFSv4 delegation policy was to give out a read delegation on any open when it was possible to. Since the lifetime of a delegation isn't limited to that of an open, a client may quite reasonably hang on to a delegation as long as it has the inode cached. This becomes an obvious problem the first time a client's inode cache approaches the size of the server's total memory. Our first quick solution was to add a hard-coded limit. This patch makes a mild incremental improvement by varying that limit according to the server's total memory size, allowing at most 4 delegations per megabyte of RAM. My quick back-of-the-envelope calculation finds that in the worst case (where every delegation is for a different inode), a delegation could take about 1.5K, which would make the worst case usage about 6% of memory. The new limit works out to be about the same as the old on a 1-gig server. [akpm@linux-foundation.org: Don't needlessly bloat vmlinux] [akpm@linux-foundation.org: Make it right for highmem machines] Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:07 -07:00
Christoph Hellwig	a569425512	knfsd: exportfs: add exportfs.h header currently the export_operation structure and helpers related to it are in fs.h. fs.h is already far too large and there are very few places needing the export bits, so split them off into a separate header. [akpm@linux-foundation.org: fix cifs build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:06 -07:00
Tejun Heo	9281acea6a	kallsyms: make KSYM_NAME_LEN include space for trailing '\0' KSYM_NAME_LEN is peculiar in that it does not include the space for the trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating buffer. This is nonsense and error-prone. Moreover, when the caller forgets that it's very likely to subtly bite back by corrupting the stack because the last position of the buffer is always cleared to zero. This patch increments KSYM_NAME_LEN by one and updates code accordingly. * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro is fixed. * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together, MODULE_NAME_LEN was treated as if it didn't include space for the trailing '\0'. Fix it. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Paulo Marques <pmarques@grupopie.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Christoph Lameter	2a7326b5bb	CONFIG_BOUNCE to avoid useless inclusion of bounce buffer logic The bounce buffer logic is included on systems that do not need it. If a system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to the use of bounce buffers then there is no need to reserve memory pools etc etc. This is true f.e. for SGI Altix. Also nicifies the Makefile and gets rid of the tricky "and" there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Rafael J. Wysocki	8314418629	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Nick Piggin	787d2214c1	fs: introduce some page/buffer invariants It is a bug to set a page dirty if it is not uptodate unless it has buffers. If the page has buffers, then the page may be dirty (some buffers dirty) but not uptodate (some buffers not uptodate). The exception to this rule is if the set_page_dirty caller is racing with truncate or invalidate. A buffer can not be set dirty if it is not uptodate. If either of these situations occurs, it indicates there could be some data loss problem. Some of these warnings could be a harmless one where the page or buffer is set uptodate immediately after it is dirtied, however we should fix those up, and enforce this ordering. Bring the order of operations for truncate into line with those of invalidate. This will prevent a page from being able to go !uptodate while we're holding the tree_lock, which is probably a good thing anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Robert P. J. Day	a1ed3dda0a	MM: Make needlessly global hugetlb_no_page() static. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	8ab1372fac	SLUB: Fix CONFIG_SLUB_DEBUG use for CONFIG_NUMA We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that embedded systems start to use NUMA we may need this. Put an #ifdef around places where NUMA only code uses fields only valid for CONFIG_SLUB_DEBUG. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	a0e1d1be20	SLUB: Move sysfs operations outside of slub_lock Sysfs can do a gazillion things when called. Make sure that we do not call any sysfs functions while holding the slub_lock. Just protect the essentials: 1. The list of all slab caches 2. The kmalloc_dma array 3. The ref counters of the slabs. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	434e245ddd	SLUB: Do not allocate object bit array on stack The objects per slab increase with the current patches in mm since we allow up to order 3 allocs by default. More patches in mm actually allow to use 2M or higher sized slabs. For slab validation we need per object bitmaps in order to check a slab. We end up with up to 64k objects per slab resulting in a potential requirement of 8K stack space. That does not look good. Allocate the bit arrays via kmalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	94f6030ca7	Slab allocators: Replace explicit zeroing with __GFP_ZERO kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing variant in the past. But with __GFP_ZERO it is possible now to do zeroing while allocating. Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever we can. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Christoph Lameter	81cda66261	Slab allocators: Cleanup zeroing allocations It becomes now easy to support the zeroing allocs with generic inline functions in slab.h. Provide inline definitions to allow the continued use of kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing functions from the slab allocators and util.c. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	ce15fea827	SLUB: Do not use length parameter in slab_alloc() We can get to the length of the object through the kmem_cache_structure. The additional parameter does no good and causes the compiler to generate bad code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	12ad6843dd	SLUB: Style fix up the loop to disable small slabs Do proper spacing and we only need to do this in steps of 8. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Adrian Bunk	5af328a510	mm/slub.c: make code static Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	7b55f620e6	SLUB: Simplify dma index -> size calculation There is no need to caculate the dma slab size ourselves. We can simply lookup the size of the corresponding non dma slab. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	f1b2633936	SLUB: faster more efficient slab determination for __kmalloc kmalloc_index is a long series of comparisons. The attempt to replace kmalloc_index with something more efficient like ilog2 failed due to compiler issues with constant folding on gcc 3.3 / powerpc. kmalloc_index()'es long list of comparisons works fine for constant folding since all the comparisons are optimized away. However, SLUB also uses kmalloc_index to determine the slab to use for the __kmalloc_xxx functions. This leads to a large set of comparisons in get_slab(). The patch here allows to get rid of that list of comparisons in get_slab(): 1. If the requested size is larger than 192 then we can simply use fls to determine the slab index since all larger slabs are of the power of two type. 2. If the requested size is smaller then we cannot use fls since there are non power of two caches to be considered. However, the sizes are in a managable range. So we divide the size by 8. Then we have only 24 possibilities left and then we simply look up the kmalloc index in a table. Code size of slub.o decreases by more than 200 bytes through this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	dfce8648d6	SLUB: do proper locking during dma slab creation We modify the kmalloc_cache_dma[] array without proper locking. Do the proper locking and undo the dma cache creation if another processor has already created it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	2e443fd003	SLUB: extract dma_kmalloc_cache from get_cache. The rarely used dma functionality in get_slab() makes the function too complex. The compiler begins to spill variables from the working set onto the stack. The created function is only used in extremely rare cases so make sure that the compiler does not decide on its own to merge it back into get_slab(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	0c71001320	SLUB: add some more inlines and #ifdef CONFIG_SLUB_DEBUG Add #ifdefs around data structures only needed if debugging is compiled into SLUB. Add inlines to small functions to reduce code size. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	d07dbea464	Slab allocators: support __GFP_ZERO in all allocators A kernel convention for many allocators is that if __GFP_ZERO is passed to an allocator then the allocated memory should be zeroed. This is currently not supported by the slab allocators. The inconsistency makes it difficult to implement in derived allocators such as in the uncached allocator and the pool allocators. In addition the support zeroed allocations in the slab allocators does not have a consistent API. There are no zeroing allocator functions for NUMA node placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are only provided for default allocs (kzalloc, kmem_cache_zalloc_node). __GFP_ZERO will make zeroing universally available and does not require any addititional functions. So add the necessary logic to all slab allocators to support __GFP_ZERO. The code is added to the hot path. The gfp flags are on the stack and so the cacheline is readily available for checking if we want a zeroed object. Zeroing while allocating is now a frequent operation and we seem to be gradually approaching a 1-1 parity between zeroing and not zeroing allocs. The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	6cb8f91320	Slab allocators: consistent ZERO_SIZE_PTR support and NULL result semantics Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h. Make ZERO_SIZE_PTR work for all slab allocators and get rid of the WARN_ON_ONCE(size == 0) that is still remaining in SLAB. Make slub return NULL like the other allocators if a too large memory segment is requested via __kmalloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	ef2ad80c7d	Slab allocators: consolidate code for krealloc in mm/util.c The size of a kmalloc object is readily available via ksize(). ksize is provided by all allocators and thus we can implement krealloc in a generic way. Implement krealloc in mm/util.c and drop slab specific implementations of krealloc. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	d45f39cb06	SLUB Debug: fix initial object debug state of NUMA bootstrap objects The function we are calling to initialize object debug state during early NUMA bootstrap sets up an inactive object giving it the wrong redzone signature. The bootstrap nodes are active objects and should have active redzone signatures. Currently slab validation complains and reverts the object to active state. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	6300ea7503	SLUB: ensure that the number of objects per slab stays low for high orders Currently SLUB has no provision to deal with too high page orders that may be specified on the kernel boot line. If an order higher than 6 (on a 4k platform) is generated then we will BUG() because slabs get more than 65535 objects. Add some logic that decreases order for slabs that have too many objects. This allow booting with slab sizes up to MAX_ORDER. For example slub_min_order=10 will boot with a default slab size of 4M and reduce slab sizes for small object sizes to lower orders if the number of objects becomes too big. Large slab sizes like that allow a concentration of objects of the same slab cache under as few as possible TLB entries and thus potentially reduces TLB pressure. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	68dff6a9af	SLUB slab validation: Move tracking information alloc outside of lock We currently have to do an GFP_ATOMIC allocation because the list_lock is already taken when we first allocate memory for tracking allocation information. It would be better if we could avoid atomic allocations. Allocate a size of the tracking table that is usually sufficient (one page) before we take the list lock. We will then only do the atomic allocation if we need to resize the table to become larger than a page (mostly only needed under large NUMA because of the tracking of cpus and nodes otherwise the table stays small). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	5b95a4acf1	SLUB: use list_for_each_entry for loops over all slabs Use list_for_each_entry() instead of list_for_each(). Get rid of for_all_slabs(). It had only one user. So fold it into the callback. This also gets rid of cpu_slab_flush. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Christoph Lameter	2492268472	SLUB: change error reporting format to follow lockdep loosely Changes the error reporting format to loosely follow lockdep. If data corruption is detected then we generate the following lines: ============================================ BUG <slab-cache>: <problem> -------------------------------------------- INFO: <more information> [possibly multiple times] <object dump> FIX <slab-cache>: <remedial action> This also adds some more intelligence to the data corruption detection. Its now capable of figuring out the start and end. Add a comment on how to configure SLUB so that a production system may continue to operate even though occasional slab corruption occur through a misbehaving kernel component. See "Emergency operations" in Documentation/vm/slub.txt. [akpm@linux-foundation.org: build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:01 -07:00
Rusty Russell	8e1f936b73	mm: clean up and kernelify shrinker registration I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:00 -07:00
Andy Whitcroft	5ad333eb66	Lumpy Reclaim V4 When we are out of memory of a suitable size we enter reclaim. The current reclaim algorithm targets pages in LRU order, which is great for fairness at order-0 but highly unsuitable if you desire pages at higher orders. To get pages of higher order we must shoot down a very high proportion of memory; >95% in a lot of cases. This patch set adds a lumpy reclaim algorithm to the allocator. It targets groups of pages at the specified order anchored at the end of the active and inactive lists. This encourages groups of pages at the requested orders to move from active to inactive, and active to free lists. This behaviour is only triggered out of direct reclaim when higher order pages have been requested. This patch set is particularly effective when utilised with an anti-fragmentation scheme which groups pages of similar reclaimability together. This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms the foundation. Credit to Mel Gorman for sanitity checking. Mel said: The patches have an application with hugepage pool resizing. When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can be resized with greater reliability. Testing on a desktop machine with 2GB of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own was very slow as the success rate was quite low. Without lumpy-reclaim, each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical. [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup] [bunk@stusta.de: static declarations for internal functions] [a.p.zijlstra@chello.nl: initial lumpy V2 implementation] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	7e63efef85	Add a movablecore= parameter for sizing ZONE_MOVABLE This patch adds a new parameter for sizing ZONE_MOVABLE called movablecore=. While kernelcore= is used to specify the minimum amount of memory that must be available for all allocation types, movablecore= is used to specify the minimum amount of memory that is used for migratable allocations. The amount of memory used for migratable allocations determines how large the huge page pool could be dynamically resized to at runtime for example. How movablecore is actually handled is that the total number of pages in the system is calculated and a value is set for kernelcore that is kernelcore == totalpages - movablecore Both kernelcore= and movablecore= can be safely specified at the same time. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	ed7ed36517	handle kernelcore=: generic This patch adds the kernelcore= parameter for x86. Once all patches are applied, a new command-line parameter exist and a new sysctl. This patch adds the necessary documentation. From: Yasunori Goto <y-goto@jp.fujitsu.com> When "kernelcore" boot option is specified, kernel can't boot up on ia64 because of an infinite loop. In addition, the parsing code can be handled in an architecture-independent manner. This patch uses common code to handle the kernelcore= parameter. It is only available to architectures that support arch-independent zone-sizing (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will ignore the boot parameter. [bunk@stusta.de: make cmdline_parse_kernelcore() static] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	396faf0303	Allow huge page allocations to use GFP_HIGH_MOVABLE Huge pages are not movable so are not allocated from ZONE_MOVABLE. However, as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. This allows an administrator to resize the hugepage pool at runtime depending on the size of ZONE_MOVABLE. This patch adds a new sysctl called hugepages_treat_as_movable. When a non-zero value is written to it, future allocations for the huge page pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	2a1e274acf	Create the ZONE_MOVABLE zone The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. The patches may be applied with the list-based anti-fragmentation patches that groups pages together based on mobility. The size of the zone is determined by a kernelcore= parameter specified at boot-time. This specifies how much memory is usable by non-movable allocations and the remainder is used for ZONE_MOVABLE. Any range of pages within ZONE_MOVABLE can be released by migrating the pages or by reclaiming. When selecting a zone to take pages from for ZONE_MOVABLE, there are two things to consider. First, only memory from the highest populated zone is used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second, the amount of memory usable by the kernel will be spread evenly throughout NUMA nodes where possible. If the nodes are not of equal size, the amount of memory usable by the kernel on some nodes may be greater than others. By default, the zone is not as useful for hugetlb allocations because they are pinned and non-migratable (currently at least). A sysctl is provided that allows huge pages to be allocated from that zone. This means that the huge page pool can be resized to the size of ZONE_MOVABLE during the lifetime of the system assuming that pages are not mlocked. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. Credit goes to Andy Whitcroft for catching a large variety of problems during review of the patches. This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added memory continues to be placed in their existing destination as there is no mechanism to redirect them to a specific zone. [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code] [akpm@linux-foundation.org: various fixes] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Mel Gorman	769848c038	Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated It is often known at allocation time whether a page may be migrated or not. This patch adds a flag called __GFP_MOVABLE and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. An API function very similar to alloc_zeroed_user_highpage() is added for __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The flags used by alloc_zeroed_user_highpage() are not changed because it would change the semantics of an existing API. After this patch is applied there are no in-kernel users of alloc_zeroed_user_highpage() so it probably should be marked deprecated if this patch is merged. Note that this patch includes a minor cleanup to the use of __GFP_ZERO in shmem.c to keep all flag modifications to inode->mapping in the shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of Hugh Dickens. Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the concept. Credit to Hugh Dickens for catching issues with shmem swap vector and ramfs allocations. [akpm@linux-foundation.org: build fix] [hugh@veritas.com: __GFP_ZERO cleanup] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
NeilBrown	a32ea1e1f9	Fix read/truncate race do_generic_mapping_read currently samples the i_size at the start and doesn't do so again unless it needs to call ->readpage to load a page. After ->readpage it has to re-sample i_size as a truncate may have caused that page to be filled with zeros, and the read() call should not see these. However there are other activities that might cause ->readpage to be called on a page between the time that do_generic_mapping_read samples i_size and when it finds that it has an uptodate page. These include at least read-ahead and possibly another thread performing a read. So do_generic_mapping_read must sample i_size after it has an uptodate page. Thus the current sampling at the start and after a read can be replaced with a sampling before the copy-out. The same change applied to __generic_file_splice_read. Note that this fixes any race with truncate_complete_page, but does not fix a possible race with truncate_partial_page. If a partial truncate happens after do_generic_mapping_read samples i_size and before the copy_out, the nuls that truncate_partial_page place in the page could be copied out incorrectly. I think the best fix for that is to not zero out parts of the page in truncate_partial_page, but rather to zero out the tail of a page when increasing i_size. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:22:59 -07:00
Linus Torvalds	489de30259	Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (209 commits) [POWERPC] Create add_rtc() function to enable the RTC CMOS driver [POWERPC] Add H_ILLAN_ATTRIBUTES hcall number [POWERPC] xilinxfb: Parameterize xilinxfb platform device registration [POWERPC] Oprofile support for Power 5++ [POWERPC] Enable arbitary speed tty ioctls and split input/output speed [POWERPC] Make drivers/char/hvc_console.c:khvcd() static [POWERPC] Remove dead code for preventing pread() and pwrite() calls [POWERPC] Remove unnecessary #undef printk from prom.c [POWERPC] Fix typo in Ebony default DTS [POWERPC] Check for NULL ppc_md.init_IRQ() before calling [POWERPC] Remove extra return statement [POWERPC] pasemi: Don't auto-select CONFIG_EMBEDDED [POWERPC] pasemi: Rename platform [POWERPC] arch/powerpc/kernel/sysfs.c: Move NUMA exports [POWERPC] Add __read_mostly support for powerpc [POWERPC] Modify sched_clock() to make CONFIG_PRINTK_TIME more sane [POWERPC] Create a dummy zImage if no valid platform has been selected [POWERPC] PS3: Bootwrapper support. [POWERPC] powermac i2c: Use mutex [POWERPC] Schedule removal of arch/ppc ... Fixed up conflicts manually in: Documentation/feature-removal-schedule.txt arch/powerpc/kernel/pci_32.c arch/powerpc/kernel/pci_64.c include/asm-powerpc/pci.h and asked the powerpc people to double-check the result..	2007-07-16 17:58:08 -07:00
Linus Torvalds	b91cba52e9	Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: (68 commits) sh: sh-rtc support for SH7709. sh: Revert __xdiv64_32 size change. sh: Update r7785rp defconfig. sh: Export div symbols for GCC 4.2 and ST GCC. sh: fix race in parallel out-of-tree build sh: Kill off dead mach.c for hp6xx. sh: hd64461.h cleanup and added comments. sh: Update the alignment when 4K stacks are used. sh: Add a .bss.page_aligned section for 4K stacks. sh: Don't let SH-4A clobber SH-4 CFLAGS. sh: Add parport stub for SuperIO ports. sh: Drop -Wa,-dsp for DSP tuning. sh: Update dreamcast defconfig. fb: pvr2fb: A few more __devinit annotations for PCI. fb: pvr2fb: Fix up section mismatch warnings. sh: Select IPR-IRQ for SH7091. sh: Correct __xdiv64_32/div64_32 return value size. sh: Fix timer-tmu build for SH-3. sh: Add cpu and mach links to CLEAN_FILES. sh: Preliminary support for the SH-X3 CPU. ...	2007-07-16 10:32:02 -07:00
Rusty Russell	c80e7a826c	permit mempool_free(NULL) Christian Borntraeger points out that mempool_free() doesn't noop when handed NULL. This is inconsistent with the other free-like functions in the kernel. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Adrian Bunk	8f8a68ee48	remove mm/backing-dev.c:congestion_wait_interruptible() congestion_wait_interruptible() is no longer used. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:52 -07:00
Andrew Morton	3e733f071e	dirty_writeback_centisecs_handler() cleanup Repair indenting bustage. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:47 -07:00
Akinobu Mita	54114994f4	fault-injection: add min-order parameter to fail_page_alloc Limiting smaller allocation failures by fault injection helps to find real possible bugs. Because higher order allocations are likely to fail and zero-order allocations are not likely to fail. This patch adds min-order parameter to fail_page_alloc. It specifies the minimum page allocation order to be injected failures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:45 -07:00
Micah Cowan	17973f5af7	Only send SIGXFSZ when exceeding rlimits. Some users have been having problems with utilities like cp or dd dumping core when they try to copy a file that's too large for the destination filesystem (typically, > 4gb). Apparently, some defunct standards required SIGXFSZ to be sent in such circumstances, but SUS only requires/allows it for when a written file exceeds the process's resource limits. I'd like to limit SIGXFSZs to the bare minimum required by SUS. Patch sent per http://lkml.org/lkml/2007/4/10/302 Signed-off-by: Micah Cowan <micahcowan@ubuntu.com> Acked-by: Alan Cox <alan@redhat.com> Cc: <reiserfs-dev@namesys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:43 -07:00
Stephen Rothwell	f057eac0d7	Introduce CONFIG_VIRT_TO_BUS Make some offending drivers depend on it and set CONFIG_ARCH_NO_VIRT_TO_BUS for ppc64 so that we don't build those drivers. This gets PowerPC allmodconfig and allyesconfig much closer to building. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:42 -07:00
Greg Ungerer	57c8f63e8e	nommu: stub expand_stack() for nommu case Be consistent with VM mmap, implement expand_stack(). We can't actually do anything other than return an error in the no MMU case though. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:37 -07:00
Miklos Szeredi	0165ab4435	split mmap This is a straightforward split of do_mmap_pgoff() into two functions: - do_mmap_pgoff() checks the parameters, and calculates the vma flags. Then it calls - mmap_region(), which does the actual mapping Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:37 -07:00
akpm@linux-foundation.org	c44939ecb6	NeilBrown <neilb@suse.de> The do_loop_readv_writev implementation of readv breaks out of the loop as soon as a single read request didn't fill it's buffer: if (nr != len) break; The generic_file_aio_read version doesn't. So if it hits EOF before the end of the list of buffers, it will try again on the next buffer. If the file was extended in the mean time, this will produce a bad result. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:37 -07:00
Herbert van den Bergh	5ed44a401d	do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY Fix a bug in mm/mlock.c on 32-bit architectures that prevents a user from locking more than 4GB of shared memory, or allocating more than 4GB of shared memory in hugepages, when rlim[RLIMIT_MEMLOCK] is set to RLIM_INFINITY. Signed-off-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Acked-by: Chris Mason <chris.mason@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:37 -07:00
Paul Mundt	84a01c2f8e	slob: sparsemem support Currently slob is disabled if we're using sparsemem, due to an earlier patch from Goto-san. Slob and static sparsemem work without any trouble as it is, and the only hiccup is a missing slab_is_available() in the case of sparsemem extreme. With this, we're rid of the last set of restrictions for slob usage. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Dan Aloni	b49ad484c5	mm/page_alloc.c: lower printk severity Signed-off-by: Dan Aloni <da-x@monatomic.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Paul Mundt	6193a2ff18	slob: initial NUMA support This adds preliminary NUMA support to SLOB, primarily aimed at systems with small nodes (tested all the way down to a 128kB SRAM block), whether asymmetric or otherwise. We follow the same conventions as SLAB/SLUB, preferring current node placement for new pages, or with explicit placement, if a node has been specified. Presently on UP NUMA this has the side-effect of preferring node#0 allocations (since numa_node_id() == 0, though this could be reworked if we could hand off a pfn to determine node placement), so single-CPU NUMA systems will want to place smaller nodes further out in terms of node id. Once a page has been bound to a node (via explicit node id typing), we only do block allocations from partial free pages that have a matching node id in the page flags. The current implementation does have some scalability problems, in that all partial free pages are tracked in the global freelist (with contention due to the single spinlock). However, these are things that are being reworked for SMP scalability first, while things like per-node freelists can easily be built on top of this sort of functionality once it's been added. More background can be found in: http://marc.info/?l=linux-mm&m=118117916022379&w=2 http://marc.info/?l=linux-mm&m=118170446306199&w=2 http://marc.info/?l=linux-mm&m=118187859420048&w=2 and subsequent threads. Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Jason Baron	f797779324	speed up madvise_need_mmap_write() usage In the new madvise_need_mmap_write() call we can avoid an extra case statement and function call as follows. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Adrian Bunk	897e679b17	mm/slab.c: start_cpu_timer() should be __cpuinit start_cpu_timer() should be __cpuinit (which also matches what it's callers are). __devinit didn't cause problems, it simply wasted a few bytes of memory for the common CONFIG_HOTPLUG_CPU=n case. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Paul Mundt	6ea6e6887d	mm: more __meminit annotations Currently zone_spanned_pages_in_node() and zone_absent_pages_in_node() are non-static for ARCH_POPULATES_NODE_MAP and static otherwise. However, only the non-static versions are __meminit annotated, despite only being called from __meminit functions in either case. zone_init_free_lists() is currently non-static and not __meminit annotated either, despite only being called once in the entire tree by init_currently_empty_zone(), which too is __meminit. So make it static and properly annotated. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Jan Beulich	8f0accc862	kill vmalloc_earlyreserve This symbol got orphaned quite a while ago. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Jan Beulich	98011f569e	mm: fix improper .init-type section references .. which modpost started warning about. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Paul Mundt	140d5a4904	numa: mempolicy: trivial debug fixes. Enabling debugging fails to build due to the nodemask variable in do_mbind() having changed names, and then oopses on boot due to the assumption that the nodemask can be dereferenced -- which doesn't work out so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by numa_default_policy(). This fixes it up, and switches from PDprintk() to pr_debug() while we're at it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Ethan Solomita	462e00cc71	oom: stop allocating user memory if TIF_MEMDIE is set get_user_pages() can try to allocate a nearly unlimited amount of memory on behalf of a user process, even if that process has been OOM killed. The OOM kill occurs upon return to user space via a SIGKILL, but get_user_pages() will try allocate all its memory before returning. Change get_user_pages() to check for TIF_MEMDIE, and if set then return immediately. Signed-off-by: Ethan Solomita <solo@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Paul Mundt	b71636e298	numa: mempolicy: dynamic interleave map for system init This converts the default system init memory policy to use a dynamically created node map instead of defaulting to all online nodes. Nodes of a certain size (>= 16MB) are judged to be suitable for interleave, and are added to the map. If all nodes are smaller in size, the largest one is automatically selected. Without this, tiny nodes find themselves out of memory before we even make it to userspace. Systems with large nodes will notice no change. Only the system init policy is effected by this change, the regular MPOL_DEFAULT policy is still switched to later on in the boot process as normal. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Christoph Lameter	f0630fff54	SLUB: support slub_debug on by default Add a new configuration variable CONFIG_SLUB_DEBUG_ON If set then the kernel will be booted by default with slab debugging switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging is available but must be enabled by specifying "slub_debug" as a kernel parameter. Also add support to switch off slab debugging for a kernel that was built with CONFIG_SLUB_DEBUG_ON. This works by specifying slub_debug=- as a kernel parameter. Dave Jones wanted this feature. http://marc.info/?l=linux-kernel&m=118072189913045&w=2 [akpm@linux-foundation.org: clean up switch statement] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Andrew Morton	fc9a07e7bf	invalidate_mapping_pages(): add cond_resched invalidate_mapping_pages() can sometimes take a long time (millions of pages to free). Long enough for the softlockup detector to trigger. We used to have a cond_resched() in there but I took it out because the drop_caches code calls invalidate_mapping_pages() under inode_lock. The patch adds a nasty flag and puts the cond_resched() back. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:36 -07:00
Nick Piggin	45426812d6	mm: debug check for the fault vs invalidate race Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable for both linear and nonlinear pages with a userspace test harness (using direct IO and truncate, respectively). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Joe Jin	f96efd585b	hugetlb: fix race in alloc_fresh_huge_page() That static `nid' index needs locking. Without it we can end up calling alloc_pages_node() with an illegal node ID and the kernel crashes. Acked-by: gurudas pai <gurudas.pai@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Anderson Briglia	2706a1b89b	vmscan: fix comments related to shrink_list() Fix the shrink_list name on some files under mm/ directory. Signed-off-by: Anderson Briglia <anderson.briglia@indt.org.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Nick Piggin	553948491c	slob: improved alignment handling Remove the core slob allocator's minimum alignment restrictions, and instead introduce the alignment restrictions at the slab API layer. This lets us heed the ARCH_KMALLOC/SLAB_MINALIGN directives, and also use __alignof__ (unsigned long) for the default alignment (which should allow relaxed alignment architectures to take better advantage of SLOB's small minimum alignment). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Nick Piggin	d87a133fc2	slob: remove bigblock tracking Remove the bigblock lists in favour of using compound pages and going directly to the page allocator. Allocation size is stored in page->private, which also makes ksize more accurate than it previously was. Saves ~.5K of code, and 12-24 bytes overhead per >= PAGE_SIZE allocation. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Nick Piggin	95b35127f1	slob: rework freelist handling Improve slob by turning the freelist into a list of pages using struct page fields, then each page has a singly linked freelist of slob blocks via a pointer in the struct page. - The first benefit is that the slob freelists can be indexed by a smaller type (2 bytes, if the PAGE_SIZE is reasonable). - Next is that freeing is much quicker because it does not have to traverse the entire freelist. Allocation can be slightly faster too, because we can skip almost-full freelist pages completely. - Slob pages are then freed immediately when they become empty, rather than having a periodic timer try to free them. This gives efficiency and memory consumption improvement. Then, we don't encode seperate size and next fields into each slob block, rather we use the sign bit to distinguish between "size" or "next". Then size 1 blocks contain a "next" offset, and others contain the "size" in the first unit and "next" in the second unit. - This allows minimum slob allocation alignment to go from 8 bytes to 2 bytes on 32-bit and 12 bytes to 2 bytes on 64-bit. In practice, it is best to align them to word size, however some architectures (eg. cris) could gain space savings from turning off this extra alignment. Then, make kmalloc use its own slob_block at the front of the allocation in order to encode allocation size, rather than rely on not overwriting slob's existing header block. - This reduces kmalloc allocation overhead similarly to alignment reductions. - Decouples kmalloc layer from the slob allocator. Then, add a page flag specific to slob pages. - This means kfree of a page aligned slob block doesn't have to traverse the bigblock list. I would get benchmarks, but my test box's network doesn't come up with slob before this patch. I think something is timing out. Anyway, things are faster after the patch. Code size goes up about 1K, however dynamic memory usage _should_ be lower even on relatively small memory systems. Future todo item is to restore the cyclic free list search, rather than to always begin at the start. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Eric Dumazet	1037b83bd0	MM: alloc_large_system_hash() can free some memory for non power-of-two bucketsize alloc_large_system_hash() is called at boot time to allocate space for several large hash tables. Lately, TCP hash table was changed and its bucketsize is not a power-of-two anymore. On most setups, alloc_large_system_hash() allocates one big page (order > 0) with __get_free_pages(GFP_ATOMIC, order). This single high_order page has a power-of-two size, bigger than the needed size. We can free all pages that wont be used by the hash table. On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory. TCP established hash table entries: 32768 (order: 6, 393216 bytes) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Pavel Emelianov	b92151bab9	Make /proc/slabinfo use seq_list_xxx helpers This entry prints a header in .start callback. This is OK, but the more elegant solution would be to move this into the .show callback and use seq_list_start_head() in .start one. I have left it as is in order to make the patch just switch to new API and noting more. [adobriyan@sw.ru: Wrong pointer was used as kmem_cache pointer] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Rolf Eike Beer	68e116a3b5	MM: use DIV_ROUND_UP() in mm/memory.c Replace a hand coded version of DIV_ROUND_UP(). Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Nishanth Aravamudan	31a5c6e4f2	hugetlb: remove unnecessary nid initialization nid is initialized to numa_node_id() but will either be overwritten in the loop or not used in the conditional. So remove the initialization. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
KAMEZAWA Hiroyuki	f0c0b2b808	change zonelist order: zonelist order selection logic Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-16 09:05:35 -07:00
Eric Paris	ed03218951	security: Protection for exploiting null dereference using mmap Add a new security check on mmap operations to see if the user is attempting to mmap to low area of the address space. The amount of space protected is indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to 0, preserving existing behavior. This patch uses a new SELinux security class "memprotect." Policy already contains a number of allow rules like a_t self:process * (unconfined_t being one of them) which mean that putting this check in the process class (its best current fit) would make it useless as all user processes, which we also want to protect against, would be allowed. By taking the memprotect name of the new class it will also make it possible for us to move some of the other memory protect permissions out of 'process' and into the new class next time we bump the policy version number (which I also think is a good future idea) Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-07-11 22:52:29 -04:00
Paul Mackerras	bf22f6fe2d	Merge branch 'for-2.6.23' into merge	2007-07-11 13:28:26 +10:00
Carsten Otte	d054fe3d10	xip sendfile removal This patch removes xip_file_sendfile, the sendfile implementation for xip without replacement. Those customers that use xip on s390 are not using sendfile() as far as we know, and so far s390 is the only platform this could potentially be used on so far. Having sendfile is not a popular feature for execute in place file systems, however we have a working implementation of splice_read() based on fs/splice.c if anyone asks for it. At this point in time, it does not seem preferable to merge splice_read() for xip because it causes extra maintenence effort due to code duplication and it requires struct page behind the xip memory segment. We'd like to get rid of that in favor of supporting flash based embedded platforms (Monta Vista work) soon. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:15 +02:00
Hugh Dickins	ae97641646	shmem: convert to using splice instead of sendfile() Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice, loop and sendfile in the simplest way, using generic_file_splice_read and generic_file_splice_write (with the aid of shmem_prepare_write). We could make some efficiency tweaks later, if there's a real need; but this is stable and works well as is. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:15 +02:00
Jens Axboe	0452a4e5d0	sendfile: kill generic_file_sendfile() It's no longer used. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-10 08:04:14 +02:00
Peter Zijlstra	4e99325b46	mm: double mark_page_accessed() in read_cache_page_async() Fix a post-2.6.21 regression. read_cache_page_async() has two invocations of mark_page_accessed() which will launch pages right onto the active list. Remove the first one, keeping the latter one. This avoids marking unwanted pages active (in the retry loop). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-08 10:13:21 -07:00
Christoph Lameter	d23cf676d0	slub: remove useless EXPORT_SYMBOL kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier time period where kmem_cache_open was usable outside of slub. (Fixes powerpc build error) Signed-off-by: Chrsitoph Lameter <clameter@sgi.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-06 11:45:11 -07:00
Peter Zijlstra	23c1fb5296	mm: fixup /proc/vmstat output Line up the vmstat_text with zone_stat_item enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) / NR_FREE_PAGES, NR_INACTIVE, NR_ACTIVE, We current have nr_active and nr_inactive reversed. [ "OK with patch, though using initializers canbe handy to prevent such things in future: static const char const vmstat_text[] = { [NR_FREE_PAGES] = "nr_free_pages", ..." - Alexey ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-06 10:26:50 -07:00
David Woodhouse	87a927c715	Fix slab redzone alignment Commit `b46b8f19c9` fixed a couple of bugs by switching the redzone to 64 bits. Unfortunately, it neglected to ensure that the _second_ redzone, after the slab object, is aligned correctly. This caused illegal instruction faults on sparc32, which for some reason not entirely clear to me are not trapped and fixed up. Two things need to be done to fix this: - increase the object size, rounding up to alignof(long long) so that the second redzone can be aligned correctly. - If SLAB_STORE_USER is set but alignof(long long)==8, allow a full 64 bits of space for the user word at the end of the buffer, even though we may not _use_ the whole 64 bits. This patch should be a no-op on any 64-bit architecture or any 32-bit architecture where alignof(long long) == 4. Of the others, it's tested on ppc32 by myself and a very similar patch was tested on sparc32 by Mark Fortescue, who reported the new problem. Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to the new sizes. Again noticed by Mark. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-05 15:54:13 -07:00
Christoph Lameter	dbc55faa64	SLUB: Make lockdep happy by not calling add_partial with interrupts enabled during bootstrap If we move the local_irq_enable() to the end of the function then add_partial() in early_kmem_cache_node_alloc() will be called with interrupts disabled like during regular operations. This makes lockdep happy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Andre Noll <maan@systemlinux.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-03 13:56:13 -07:00
Christoph Lameter	17022220dd	SLAB: remove WARN_ON_ONCE for zero sized objects for 2.6.22 release We agreed to remove the WARN_ON_ONCE before 2.6.22 is released. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-01 12:29:43 -07:00
Hugh Dickins	30acbabae3	mm: kill validate_anon_vma to avoid mapcount BUG validate_anon_vma gave a useful check on the integrity of the anon_vma list when Andrea was developing obj rmap; but it was not enabled in SLES9 itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system. That limit was just an arbitrary number to protect against an infinite loop. We could raise it to something enormous (depending on sizeof struct vma and size of memory?); but I rather think validate_anon_vma has outlived its usefulness, and is better just removed - which gives a magnificent performance boost to anything like Petr's test program ;) Of course, a very long anon_vma list is bad news for preemption latency, and I believe there has been one recent report of such: let's not forget that, but validate_anon_vma only makes it worse not better. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Petr Vandrovec <petr@vmware.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrea Arcangeli <andrea@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-28 11:34:53 -07:00
Christoph Lameter	8496634302	SLUB: fix behavior if the text output of list_locations overflows PAGE_SIZE If slabs are allocated or freed from a large set of call sites (typical for the kmalloc area) then we may create more output than fits into a single PAGE and sysfs only gives us one page. The output should be truncated. This patch fixes the checks to do the truncation properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 08:59:11 -07:00
Helge Deller	06b32f3ab6	[PARISC] Handle wrapping in expand_upwards() Function expand_upwards() did not guarded against wrapping around to address 0. This fixes the adjtimex02 testcase from the Linux Test Project on a 32bit PARISC kernel. [expand_upwards is only used on parisc and ia64; it looks like it does the right thing on both. --kyle] Signed-off-by: Helge Deller <deller@gmx.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>	2007-06-21 17:46:20 -04:00
Christoph Lameter	4b356be019	SLUB: minimum alignment fixes If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again) because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the size of the small slabs is all the same. No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which for some reason wants a 128 byte alignment. This patch increases the size of the smallest cache if ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the smallest caches are disabled. If we do that then the count of the active general caches that is displayed on boot is not correct anymore since we may skip elements of the kmalloc array. So count them separately. This approach was tested by Havard yesterday. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:16 -07:00
Benjamin Herrenschmidt	8dab5241d0	Rework ptep_set_access_flags and fix sun4c Some changes done a while ago to avoid pounding on ptep_set_access_flags and update_mmu_cache in some race situations break sun4c which requires update_mmu_cache() to always be called on minor faults. This patch reworks ptep_set_access_flags() semantics, implementations and callers so that it's now responsible for returning whether an update is necessary or not (basically whether the PTE actually changed). This allow fixing the sparc implementation to always return 1 on sun4c. [akpm@linux-foundation.org: fixes, cleanups] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: David Miller <davem@davemloft.net> Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:16 -07:00
Christoph Lameter	dd08c40e3e	SLUB slab validation: Alloc while interrupts are disabled must use GFP_ATOMIC The data structure to manage the information gathered about functions allocating and freeing objects is allocated when the list_lock has already been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
Paul Mundt	d09c6b8094	mm: Fix memory/cpu hotplug section mismatch and oops. When building with memory hotplug enabled and cpu hotplug disabled, we end up with the following section mismatch: WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to .init.text: (between 'free_area_init_node' and '__build_all_zonelists') This happens as a result of: -> free_area_init_node() -> free_area_init_core() -> zone_pcp_init() <-- all __meminit up to this point -> zone_batchsize() <-- marked as __cpuinit fo This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but CONFIG_MEMORY_HOTPLUG=y unsets __meminit. Changing zone_batchsize() to __devinit fixes this. __devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to another section identifier completely. Without this, memory hot-add of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is not also enabled. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> -- mm/page_alloc.c \| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)	2007-06-15 16:18:08 -07:00
Benjamin Herrenschmidt	c19c03fc74	[POWERPC] unmap_vm_area becomes unmap_kernel_range for the public This makes unmap_vm_area static and a wrapper around a new exported unmap_kernel_range that takes an explicit range instead of a vm_area struct. This makes it more versatile for code that wants to play with kernel page tables outside of the standard vmalloc area. (One example is some rework of the PowerPC PCI IO space mapping code that depends on that patch and removes some code duplication and horrible abuse of forged struct vm_struct). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2007-06-14 22:29:56 +10:00
Stephen Rothwell	193faea928	Move three functions that are only needed for CONFIG_MEMORY_HOTPLUG into the appropriate #ifdef. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:33 -07:00
Christoph Lameter	272c1d21d6	SLUB: return ZERO_SIZE_PTR for kmalloc(0) Instead of returning the smallest available object return ZERO_SIZE_PTR. A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL. This enables functions to use zero sized object. e.g. n = number of objects. objects = kmalloc(n * sizeof(object)); for (i = 0; i < n; i++) objects[i].x = y; kfree(objects); Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:33 -07:00
Christoph Lameter	3cdc0ed0ce	slab: fix alien cache handling cache_free_alien must be called regardless if we use alien caches or not. cache_free_alien() will do the right thing if there are no alien caches available. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:32 -07:00
Hugh Dickins	a210906c1b	mount -t tmpfs -o mpol=: check nodes online Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an offline node, crashes as soon as data is allocated upon it. Now restrict it to online nodes, where before it restricted to MAX_NUMNODES. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Robin Holt <holt@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:32 -07:00

1 2 3 4 5 ...

1522 Коммитов