WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Nick Piggin	5344b7e648	vmstat: mlocked pages statistics Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. [kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo] [lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Rik van Riel	ba470de431	mmap: handle mlocked pages during map, remap, unmap Originally by Nick Piggin <npiggin@suse.de> Remove mlocked pages from the LRU using "unevictable infrastructure" during mmap(), munmap(), mremap() and truncate(). Try to move back to normal LRU lists on munmap() when last mlocked mapping removed. Remove PageMlocked() status when page truncated from file. [akpm@linux-foundation.org: cleanup] [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()] [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework] [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block] [akpm@linux-foundation.org: remove bogus kerneldoc token] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Lee Schermerhorn	8edb08caf6	mlock: downgrade mmap sem while populating mlocked regions We need to hold the mmap_sem for write to initiatate mlock()/munlock() because we may need to merge/split vmas. However, this can lead to very long lock hold times attempting to fault in a large memory region to mlock it into memory. This can hold off other faults against the mm [multithreaded tasks] and other scans of the mm, such as via /proc. To alleviate this, downgrade the mmap_sem to read mode during the population of the region for locking. This is especially the case if we need to reclaim memory to lock down the region. We [probably?] don't need to do this for unlocking as all of the pages should be resident--they're already mlocked. Now, the caller's of the mlock functions [mlock_fixup() and mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode. Changing all callers appears to be way too much effort at this point. So, restore write mode before returning. Note that this opens a window where the mmap list could change in a multithreaded process. So, at least for mlock_fixup(), where we could be called in a loop over multiple vmas, we check that a vma still exists at the start address and that vma still covers the page range [start,end). If not, we return an error, -EAGAIN, and let the caller deal with it. Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if the vma at 'start' disappears or changes so that the page range [start,end) is no longer contained in the vma. Again, let the caller deal with it. Looks like only sys_remap_file_pages() [via mmap_region()] should actually care. With this patch, I no longer see processes like ps(1) blocked for seconds or minutes at a time waiting for a large [multiple gigabyte] region to be locked down. However, I occassionally see delays while unlocking or unmapping a large mlocked region. Should we also downgrade the mmap_sem for the unlock path? Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Nick Piggin	b291f00039	mlock: mlocked pages are unevictable Make sure that mlocked pages also live on the unevictable LRU, so kswapd will not scan them over and over again. This is achieved through various strategies: 1) add yet another page flag--PG_mlocked--to indicate that the page is locked for efficient testing in vmscan and, optionally, fault path. This allows early culling of unevictable pages, preventing them from getting to page_referenced()/try_to_unmap(). Also allows separate accounting of mlock'd pages, as Nick's original patch did. Note: Nick's original mlock patch used a PG_mlocked flag. I had removed this in favor of the PG_unevictable flag + an mlock_count [new page struct member]. I restored the PG_mlocked flag to eliminate the new count field. 2) add the mlock/unevictable infrastructure to mm/mlock.c, with internal APIs in mm/internal.h. This is a rework of Nick's original patch to these files, taking into account that mlocked pages are now kept on unevictable LRU list. 3) update vmscan.c:page_evictable() to check PageMlocked() and, if vma passed in, the vm_flags. Note that the vma will only be passed in for new pages in the fault path; and then only if the "cull unevictable pages in fault path" patch is included. 4) add try_to_unlock() to rmap.c to walk a page's rmap and ClearPageMlocked() if no other vmas have it mlocked. Reuses as much of try_to_unmap() as possible. This effectively replaces the use of one of the lru list links as an mlock count. If this mechanism let's pages in mlocked vmas leak through w/o PG_mlocked set [I don't know that it does], we should catch them later in try_to_unmap(). One hopes this will be rare, as it will be relatively expensive. Original mm/internal.h, mm/rmap.c and mm/mlock.c changes: Signed-off-by: Nick Piggin <npiggin@suse.de> splitlru: introduce __get_user_pages(): New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS. because current get_user_pages() can't grab PROT_NONE pages theresore it cause PROT_NONE pages can't munlock. [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch] [akpm@linux-foundation.org: untangle patch interdependencies] [akpm@linux-foundation.org: fix things after out-of-order merging] [hugh@veritas.com: fix page-flags mess] [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm'] [kosaki.motohiro@jp.fujitsu.com: build fix] [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments] [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:30 -07:00
Lee Schermerhorn	89e004ea55	SHM_LOCKED pages are unevictable Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be kept on the normal LRU, since scanning them is a waste of time and might throw off kswapd's balancing algorithms. Place them on the unevictable LRU list instead. Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared memory regions as unevictable. Then these pages will be culled off the normal LRU lists during vmscan. Add new wrapper function to clear the mapping's unevictable state when/if shared memory segment is munlocked. Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in the shmem segment's mapping [struct address_space] for evictability now that they're no longer locked. If so, move them to the appropriate zone lru list. Changes depend on [CONFIG_]UNEVICTABLE_LRU. [kosaki.motohiro@jp.fujitsu.com: revert shm change] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Lee Schermerhorn	ba9ddf4939	Ramfs and Ram Disk pages are unevictable Christoph Lameter pointed out that ram disk pages also clutter the LRU lists. When vmscan finds them dirty and tries to clean them, the ram disk writeback function just redirties the page so that it goes back onto the active list. Round and round she goes... With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no longer the case, as ram disk pages are no longer maintained on the lru. [This makes them unmigratable for defrag or memory hot remove, but that can be addressed by a separate patch series.] However, the ramfs pages behave like ram disk pages used to, so: Define new address_space flag [shares address_space flags member with mapping's gfp mask] to indicate that the address space contains all unevictable pages. This will provide for efficient testing of ramfs pages in page_evictable(). Also provide wrapper functions to set/test the unevictable state to minimize #ifdefs in ramfs driver and any other users of this facility. Set the unevictable state on address_space structures for new ramfs inodes. Test the unevictable state in page_evictable() to cull unevictable pages. These changes depend on [CONFIG_]UNEVICTABLE_LRU. [riel@redhat.com: undo the brd.c part] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Lee Schermerhorn	7b854121eb	Unevictable LRU Page Statistics Report unevictable pages per zone and system wide. Kosaki Motohiro added support for memory controller unevictable statistics. [riel@redhat.com: fix printk in show_free_areas()] [akpm@linux-foundation.org: fix units in /proc/vmstats] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Lee Schermerhorn	bbfd28eee9	unevictable lru: add event counting with statistics Fix to unevictable-lru-page-statistics.patch Add unevictable lru infrastructure vm events to the statistics patch. Rename the "NORECL_" and "noreclaim_" symbols and text strings to "UNEVICTABLE_" and "unevictable_", respectively. Currently, both the infrastructure and the mlocked pages event are added by a single patch later in the series. This makes it difficult to add or rework the incremental patches. The events actually "belong" with the stats, so pull them up to here. Also, restore the event counting to putback_lru_page(). This was removed from previous patch in series where it was "misplaced". The actual events weren't defined that early. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Lee Schermerhorn	894bc31041	Unevictable LRU Infrastructure When the system contains lots of mlocked or otherwise unevictable pages, the pageout code (kswapd) can spend lots of time scanning over these pages. Worse still, the presence of lots of unevictable pages can confuse kswapd into thinking that more aggressive pageout modes are required, resulting in all kinds of bad behaviour. Infrastructure to manage pages excluded from reclaim--i.e., hidden from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to maintain "unevictable" pages on a separate per-zone LRU list, to "hide" them from vmscan. Kosaki Motohiro added the support for the memory controller unevictable lru list. Pages on the unevictable list have both PG_unevictable and PG_lru set. Thus, PG_unevictable is analogous to and mutually exclusive with PG_active--it specifies which LRU list the page is on. The unevictable infrastructure is enabled by a new mm Kconfig option [CONFIG_]UNEVICTABLE_LRU. A new function 'page_evictable(page, vma)' in vmscan.c tests whether or not a page may be evictable. Subsequent patches will add the various !evictable tests. We'll want to keep these tests light-weight for use in shrink_active_list() and, possibly, the fault path. To avoid races between tasks putting pages [back] onto an LRU list and tasks that might be moving the page from non-evictable to evictable state, the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()' -- tests the "evictability" of a page after placing it on the LRU, before dropping the reference. If the page has become unevictable, putback_lru_page() will redo the 'putback', thus moving the page to the unevictable list. This way, we avoid "stranding" evictable pages on the unevictable list. [akpm@linux-foundation.org: fix fallout from out-of-order merge] [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build] [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check] [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework] [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c] [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure] [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch] [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Rik van Riel	33c120ed28	more aggressively use lumpy reclaim During an AIM7 run on a 16GB system, fork started failing around 32000 threads, despite the system having plenty of free swap and 15GB of pageable memory. This was on x86-64, so 8k stacks. If a higher order allocation fails, we can either: - keep evicting pages off the end of the LRUs and hope that we eventually create a contiguous region; this is somewhat unlikely if the system is under enough stress by new allocations - after trying normal eviction for a bit, use lumpy reclaim This patch switches the system to lumpy reclaim if the VM is having trouble freeing enough pages, using the same threshold for detection as used by pageout congestion wait. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:26 -07:00
Rik van Riel	c5fdae469a	vmscan: add newly swapped in pages to the inactive list Swapin_readahead can read in a lot of data that the processes in memory never need. Adding swap cache pages to the inactive list prevents them from putting too much pressure on the working set. This has the potential to help the programs that are already in memory, but it could also be a disadvantage to processes that are trying to get swapped in. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Rik van Riel	7e9cd48420	vmscan: fix pagecache reclaim referenced bit check Moving referenced pages back to the head of the active list creates a huge scalability problem, because by the time a large memory system finally runs out of free memory, every single page in the system will have been referenced. Not only do we not have the time to scan every single page on the active list, but since they have will all have the referenced bit set, that bit conveys no useful information. A more scalable solution is to just move every page that hits the end of the active list to the inactive list. We clear the referenced bit off of mapped pages, which need just one reference to be moved back onto the active list. Unmapped pages will be moved back to the active list after two references (see mark_page_accessed). We preserve the PG_referenced flag on unmapped pages to preserve accesses that were made while the page was on the active list. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Rik van Riel	556adecba1	vmscan: second chance replacement for anonymous pages We avoid evicting and scanning anonymous pages for the most part, but under some workloads we can end up with most of memory filled with anonymous pages. At that point, we suddenly need to clear the referenced bits on all of memory, which can take ages on very large memory systems. We can reduce the maximum number of pages that need to be scanned by not taking the referenced state into account when deactivating an anonymous page. After all, every anonymous page starts out referenced, so why check? If an anonymous page gets referenced again before it reaches the end of the inactive list, we move it back to the active list. To keep the maximum amount of necessary work reasonable, we scale the active to inactive ratio with the size of memory, using the formula active:inactive ratio = sqrt(memory in GB * 10). Kswapd CPU use now seems to scale by the amount of pageout bandwidth, instead of by the amount of memory present in the system. [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg] [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Rik van Riel	4f98a2fee8	vmscan: split LRU lists into anon & file sets Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Rik van Riel	b2e185384f	define page_file_cache() function Define page_file_cache() function to answer the question: is page backed by a file? Originally part of Rik van Riel's split-lru patch. Extracted to make available for other, independent reclaim patches. Moved inline function to linux/mm_inline.h where it will be needed by subsequent "split LRU" and "noreclaim" patches. Unfortunately this needs to use a page flag, since the PG_swapbacked state needs to be preserved all the way to the point where the page is last removed from the LRU. Trying to derive the status from other info in the page resulted in wrong VM statistics in earlier split VM patchsets. The total number of page flags in use on a 32 bit machine after this patch is 19. [akpm@linux-foundation.org: fix up out-of-order merge fallout] [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[ Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Rik van Riel	68a22394c2	vmscan: free swap space on swap-in/activation If vm_swap_full() (swap space more than 50% full), the system will free swap space at swapin time. With this patch, the system will also free the swap space in the pageout code, when we decide that the page is not a candidate for swapout (and just wasting swap space). Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
KOSAKI Motohiro	f04e9ebbe4	swap: use an array for the LRU pagevecs Turn the pagevecs into an array just like the LRUs. This significantly cleans up the source code and reduces the size of the kernel by about 13kB after all the LRU lists have been created further down in the split VM patch series. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Christoph Lameter	b69408e88b	vmscan: Use an indexed array for LRU variables Currently we are defining explicit variables for the inactive and active list. An indexed array can be more generic and avoid repeating similar code in several places in the reclaim code. We are saving a few bytes in terms of code size: Before: text data bss dec hex filename 4097753 573120 4092484 8763357 85b7dd vmlinux After: text data bss dec hex filename 4097729 573120 4092484 8763333 85b7c5 vmlinux Having an easy way to add new lru lists may ease future work on the reclaim code. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Nick Piggin	62695a84eb	vmscan: move isolate_lru_page() to vmscan.c On large memory systems, the VM can spend way too much time scanning through pages that it cannot (or should not) evict from memory. Not only does it use up CPU time, but it also provokes lock contention and can leave large systems under memory presure in a catatonic state. This patch series improves VM scalability by: 1) putting filesystem backed, swap backed and unevictable pages onto their own LRUs, so the system only scans the pages that it can/should evict from memory 2) switching to two handed clock replacement for the anonymous LRUs, so the number of pages that need to be scanned when the system starts swapping is bound to a reasonable number 3) keeping unevictable pages off the LRU completely, so the VM does not waste CPU time scanning them. ramfs, ramdisk, SHM_LOCKED shared memory segments and mlock()ed VMA pages are keept on the unevictable list. This patch: isolate_lru_page logically belongs to be in vmscan.c than migrate.c. It is tough, because we don't need that function without memory migration so there is a valid argument to have it in migrate.c. However a subsequent patch needs to make use of it in the core mm, so we can happily move it to vmscan.c. Also, make the function a little more generic by not requiring that it adds an isolated page to a given list. Callers can do that. Note that we now have '__isolate_lru_page()', that does something quite different, visible outside of vmscan.c for use with memory controller. Methinks we need to rationalize these names/purposes. --lts [akpm@linux-foundation.org: fix mm/memory_hotplug.c build] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Badari Pulavarty	71088785c6	mm: cleanup to make remove_memory() arch-neutral There is nothing architecture specific about remove_memory(). remove_memory() function is common for all architectures which support hotplug memory remove. Instead of duplicating it in every architecture, collapse them into arch neutral function. [akpm@linux-foundation.org: fix the export] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Gary Hade <garyhade@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:50:25 -07:00
Linus Torvalds	d9d332e087	anon_vma_prepare: properly lock even newly allocated entries The anon_vma code is very subtle, and we end up doing optimistic lookups of anon_vmas under RCU in page_lock_anon_vma() with no locking. Other CPU's can also see the newly allocated entry immediately after we've exposed it by setting "vma->anon_vma" to the new value. We protect against the anon_vma being destroyed by having the SLAB marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the allocation not being destroyed - but it might still be free'd and re-allocated here to a new vma. As a result, we should not do the anon_vma list ops on a newly allocated vma without proper locking. Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-19 11:50:35 -07:00
Linus Torvalds	f7ea4a4ba8	Merge branch 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6 * 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (44 commits) drm/i915: fix ioremap of a user address for non-root (CVE-2008-3831) drm: make CONFIG_DRM depend on CONFIG_SHMEM. radeon: fix PCI bus mastering support enables. radeon: add RS400 family support. drm/radeon: add support for RS740 IGP chipsets. i915: GM45 has GM965-style MCH setup. i915: Don't run retire work handler while suspended i915: Map status page cached for chips with GTT-based HWS location. i915: Fix up ring initialization to cover G45 oddities i915: Use non-reserved status page index for breadcrumb drm: Increment dev_priv->irq_received so i915_gem_interrupts count works. drm: kill drm_device->irq drm: wbinvd is cache coherent. i915: add missing return in error path. i915: fixup permissions on gem ioctls. drm: Clean up many sparse warnings in i915. drm: Use ioremap_wc in i915_driver instead of ioremap, since we always want WC. drm: G33-class hardware has a newer 965-style MCH (no DCC register). drm: Avoid oops in GEM execbuffers with bad arguments. DRM: Return -EBADF on bad object in flink, and return curent name if it exists. ...	2008-10-17 15:09:20 -07:00
Keith Packard	395e0ddc44	Export shmem_file_setup for DRM-GEM GEM needs to create shmem files to back buffer objects. Though currently creation of files for objects could have been driven from userland, the modesetting work will require allocation of buffer objects before userland is running, for boot-time message display. Signed-off-by: Eric Anholt <eric@anholt.net> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Dave Airlie <airlied@redhat.com>	2008-10-18 07:10:11 +10:00
Linus Torvalds	e533b22705	Merge branch 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails softirq, warning fix: correct a format to avoid a warning softirqs, debug: preemption check x86, pci-hotplug, calgary / rio: fix EBDA ioremap() IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description dmi scan: warn about too early calls to dmi_check_system() generic: redefine resource_size_t as phys_addr_t generic: make PFN_PHYS explicitly return phys_addr_t generic: add phys_addr_t for holding physical addresses softirq: allocate less vectors IO resources: fix/remove printk printk: robustify printk, update comment printk: robustify printk, fix #2 printk: robustify printk, fix printk: robustify printk Fixed up conflicts in: arch/powerpc/include/asm/types.h arch/powerpc/platforms/Kconfig.cputype manually.	2008-10-16 15:17:40 -07:00
Francois Cami	e1f8e87449	Remove Andrew Morton's old email accounts People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Jan Beulich	9ba16087d9	Kconfig: eliminate "def_bool n" constructs Using "def_bool n" is pointless, simply using bool here appears more appropriate. Further, retaining such options that don't have a prompt and aren't selected by anything seems also at least questionable. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Harvey Harrison	80a914dc05	misc: replace __FUNCTION__ with __func__ __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:30 -07:00
Krishna Kumar	0c6aa2639e	mm: do_generic_file_read() never gets a NULL 'filp' argument The 'filp' argument to do_generic_file_read() is never NULL. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
David Gibson	b4d1d99fdd	hugetlb: handle updating of ACCESSED and DIRTY in hugetlb_fault() The page fault path for normal pages, if the fault is neither a no-page fault nor a write-protect fault, will update the DIRTY and ACCESSED bits in the page table appropriately. The hugepage fault path, however, does not do this, handling only no-page or write-protect type faults. It assumes that either the ACCESSED and DIRTY bits are irrelevant for hugepages (usually true, since they are never swapped) or that they are handled by the arch code. This is inconvenient for some software-loaded TLB architectures, where the _PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write) access to the page at the TLB miss. This could be worked around in the arch TLB miss code, but the TLB miss fast path can be made simple more easily if the hugetlb_fault() path handles this, as the normal page fault path does. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
Andrew Morton	db99100d2e	mm/page_alloc.c:free_area_init_nodes() fix inappropriate use of enum Local variable `i' is a) misleadingly-named for an `enum zone_type' and b) used for indexing zones as well as nodes as well as node_maps. Make it an `int'. Reported-by: Frans Pop <elendil@planet.nl> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
Aneesh Kumar K.V	17bc6c30cf	vfs: Add no_nrwrite_index_update writeback control flag If no_nrwrite_index_update is set we don't update nr_to_write and address space writeback_index in write_cache_pages. This change enables a file system to skip these updates in write_cache_pages and do them in the writepages() callback. This patch will be followed by an ext4 patch that make use of these new flags. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: linux-fsdevel@vger.kernel.org	2008-10-16 10:09:17 -04:00
Linus Torvalds	73bdf0a60e	Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL Impact: crash on module insertion with CONFIG_DEBUG_VIRTUAL We would incorrectly BUG due to: VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) && !is_module_address(addr)); ... because, at least on x86-64, is_module_address() doesn't do what it should. This patch introduces is_vmalloc_or_module_addr(), which is what we really want anyway, and uses it instead. Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2008-10-16 03:25:58 -07:00
Ingo Molnar	6b2ada8210	Merge branches 'core/softlockup', 'core/softirq', 'core/resources', 'core/printk' and 'core/misc' into core-v28-for-linus	2008-10-15 12:48:44 +02:00
Oleg Nesterov	8546232355	do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails If lock_page_killable() fails because the task was killed by SIGKILL or any other fatal signal, do_generic_file_read() returns -EIO. This seems to be OK, because in fact the userspace won't see this error, the task will dequeue SIGKILL and exit. However, /sbin/init is different, it will dequeue SIGKILL, ignore it, and return to the user-space with the bogus -EIO. Change the code to return the error code from lock_page_killable(), -EINTR. This doesn't fix the bug, but perhaps makes sense anyway. Imho, with this change the code looks a bit more logical, and the "good" init should handle the spurious EINTR or short read. Afaics we can also change lock_page_killable() to return -ERESTARTNOINTR, but this can't prevent the short reads. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 17:15:33 +02:00
Aneesh Kumar K.V	74baaaaec8	vfs: Remove the range_cont writeback mode. Ext4 was the only user of range_cont writeback mode and ext4 switched to a different method. So remove the range_cont mode which is not used in the kernel. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: linux-fsdevel@vger.kernel.org	2008-10-14 09:21:02 -04:00
Mimi Zohar	9256292782	integrity: special fs magic Discussion on the mailing list questioned the use of these magic values in userspace, concluding these values are already exported to userspace via statfs and their correct/incorrect usage is left up to the userspace application. - Move special fs magic number definitions to magic.h - Add magic.h include Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-10-13 09:47:43 +11:00
Ingo Molnar	8daf14cf56	Merge branches 'x86/xen', 'x86/build', 'x86/microcode', 'x86/mm-debug-v2', 'x86/memory-corruption-check', 'x86/early-printk', 'x86/xsave', 'x86/ptrace-v2', 'x86/quirks', 'x86/setup', 'x86/spinlocks' and 'x86/signal' into x86/core-v2	2008-10-12 15:50:02 +02:00
Linus Torvalds	ec8deffa33	Merge phase #2 (PAT updates) of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-v28-for-linus-phase2-B' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) x86, cpa: make the kernel physical mapping initialization a two pass sequence, fix x86, pat: cleanups x86: fix pagetable init 64-bit breakage x86: track memtype for RAM in page struct x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa x86, cpa: remove cpa pool code x86, cpa: no need to check alias for __set_pages_p/__set_pages_np x86, cpa: dont use large pages for kernel identity mapping with DEBUG_PAGEALLOC x86, cpa: make the kernel physical mapping initialization a two pass sequence x86, cpa: remove USER permission from the very early identity mapping attribute x86, cpa: rename PTE attribute macros for kernel direct mapping in early boot x86: make sure the CPA test code's use of _PAGE_UNUSED1 is obvious linux-next: fix x86 tree build failure x86: have set_memory_array_{uc,wb} coalesce memtypes, fix agp: enable optimized agp_alloc_pages methods x86: have set_memory_array_{uc,wb} coalesce memtypes. x86: {reverve,free}_memtype() take a physical address x86: fix pageattr-test agp: add agp_generic_destroy_pages() agp: generic_alloc_pages() ...	2008-10-11 11:02:56 -07:00
Linus Torvalds	e26feff647	Merge branch 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block: (132 commits) doc/cdrom: Trvial documentation error, file not present block_dev: fix kernel-doc in new functions block: add some comments around the bio read-write flags block: mark bio_split_pool static block: Find bio sector offset given idx and offset block: gendisk integrity wrapper block: Switch blk_integrity_compare from bdev to gendisk block: Fix double put in blk_integrity_unregister block: Introduce integrity data ownership flag block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 bio.h: Remove unused conditional code block: remove end_{queued\|dequeued}_request() block: change elevator to use __blk_end_request() gdrom: change to use __blk_end_request() memstick: change to use __blk_end_request() virtio_blk: change to use __blk_end_request() blktrace: use BLKTRACE_BDEV_SIZE as the name size for setup structure block: add lld busy state exporting interface block: Fix blk_start_queueing() to not kick a stopped queue include blktrace_api.h in headers_install ...	2008-10-10 10:52:45 -07:00
Ingo Molnar	3dd392a407	Merge branch 'linus' into x86/pat2 Conflicts: arch/x86/mm/init_64.c	2008-10-10 19:30:08 +02:00
Matt Mackall	70096a561d	SLOB: fix bogus ksize calculation fix This fixes the previous fix, which was completely wrong on closer inspection. This version has been manually tested with a user-space test harness and generates sane values. A nearly identical patch has been boot-tested. The problem arose from changing how kmalloc/kfree handled alignment padding without updating ksize to match. This brings it in sync. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-09 12:18:27 -07:00
Jens Axboe	36144077bc	highmem: use bio_has_data() in the bounce path Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:01 +02:00
Matt Mackall	85ba94ba05	SLOB: fix bogus ksize calculation SLOB's ksize calculation was braindamaged and generally harmlessly underreported the allocation size. But for very small buffers, it could in fact overreport them, leading code depending on krealloc to overrun the allocation and trample other data. Signed-off-by: Matt Mackall <mpm@selenic.com> Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-07 11:19:23 -07:00
Andy Whitcroft	6babc32c41	mm: handle initialising compound pages at orders greater than MAX_ORDER When we initialise a compound page we initialise the page flags and head page pointer for all base pages spanned by that page. When we initialise a gigantic page (a page of order greater than or equal to MAX_ORDER) we have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we assume that all elements of the mem_map in this page are contigious in memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages, and with SPARSEMEM enabled they will not be contigious. This leads us to walk off the end of the first section and scribble on everything which follows, BAD. When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next section of the mem_map. As gigantic pages can only be maximally aligned we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from the start of the page. This is a bug fix for the gigantic page support in hugetlbfs. Credit to Mel Gorman for spotting the issue. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Nick Piggin	4b19de6d1c	mm: tiny-shmem nommu fix The previous patch `db203d53d4` ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Gerald Schaefer	6c1b7f680d	memory hotplug: missing zone->lock in test_pages_isolated() __test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment saying that the caller must hold zone->lock. But the only caller of that function, test_pages_isolated(), does not hold zone->lock and the lock is also not acquired anywhere before. This patch adds the missing zone->lock to test_pages_isolated(). We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during memory hotplug stress test, see trace below. This patch fixes that problem, it would be good if we could have it in 2.6.27. kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561! illegal operation: 0001 [#1] PREEMPT SMP Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1 Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28) Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00 00000000 00000400 00014000 7e27fc01 00606f00 7e27fc00 00013fe0 2b10dd28 00000005 80172662 801727b2 2b10dd28 Krnl Code: 801727de: 5810900c l %r1,12(%r9) 801727e2: a7f4ffb3 brc 15,80172748 801727e6: a7f40001 brc 15,801727e8 >801727ea: a7f4ffbc brc 15,80172762 801727ee: a7f40001 brc 15,801727f0 801727f2: a7f4ffaf brc 15,80172750 801727f6: 0707 bcr 0,%r7 801727f8: 0017 unknown Call Trace: ([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4) [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34 [<000000000013164c>] walk_memory_resource+0xcc/0x11c [<000000000019520e>] offline_pages+0x36a/0x498 [<00000000001004d6>] remove_memory+0x36/0x44 [<000000000028fb06>] memory_block_change_state+0x112/0x150 [<000000000028ffb8>] store_mem_state+0x90/0xe4 [<0000000000289c00>] sysdev_store+0x34/0x40 [<00000000001ee048>] sysfs_write_file+0xd0/0x178 [<000000000019b1a8>] vfs_write+0x74/0x118 [<000000000019b9ae>] sys_write+0x46/0x7c [<000000000011160e>] sysc_do_restart+0x12/0x16 [<0000000077f3e8ca>] 0x77f3e8ca Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Balbir Singh	31a78f23ba	mm owner: fix race between swapoff and exit There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-29 08:41:47 -07:00
Daisuke Nishimura	a10cebf56c	memcg: check under limit at shrink_usage Current memory cgroup(both in mainline and -mm) doesn't account swap caches as memory(swap cache support is dropped temporarily now). So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that have been moved to swap cache. But this makes mem_cgroup_shrink_usage fail easily if most of the pages are anon/shmem, and then shmem_getpage returns -ENOMEM and the process will be killed. This patch adds res_counter_check_under_limit to avoid these cases. BTW, even if swap cache support is enabled again, if a process is moved to another cgroup, which has been just made, between precharge and shrink_usage in shmem_getpage, shrink_usage may fail just because there is no pages to reclaim. So this change would make sense anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-23 08:09:14 -07:00
Nick Piggin	db203d53d4	mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex tiny-shmem calls do_truncate in shmem_file_setup. do_truncate takes i_mutex, and shmem_file_setup is called with mmap_sem held. However i_mutex nests outside mmap_sem. Copy the code in shmem.c to avoid this problem. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-23 08:09:14 -07:00
Salman Qazi	02b71b7012	slub: fixed uninitialized counter in struct kmem_cache_node Initialized total objects atomic for the node in init_kmem_cache_node. The uninitialized value was ruining the stats in /proc/slabinfo. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2008-09-15 09:49:05 +03:00
Ingo Molnar	f81b691a3d	Merge commit 'v2.6.27-rc6' into x86/pat	2008-09-14 17:26:53 +02:00
Jeremy Fitzhardinge	600715dcdf	generic: add phys_addr_t for holding physical addresses Add a kernel-wide "phys_addr_t" which is guaranteed to be able to hold any physical address. By default it equals the word size of the architecture, but a 32-bit architecture can set ARCH_PHYS_ADDR_T_64BIT if it needs a 64-bit phys_addr_t. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-14 17:24:25 +02:00
Mel Gorman	5bead2a068	mm: mark the correct zone as full when scanning zonelists The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when scanning zonelists to keep track of where in the zonelist it is. The zoneref that is returned corresponds to the the next zone that is to be scanned, not the current one. It was intended to be treated as an opaque list. When the page allocator is scanning a zonelist, it marks elements in the zonelist corresponding to zones that are temporarily full. As the zonelist is being updated, it uses the cursor here; if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); This is intended to prevent rescanning in the near future but the zoneref cursor does not correspond to the zone that has been found to be full. This is an easy misunderstanding to make so this patch corrects the problem by changing zoneref cursor to be the current zone being scanned instead of the next one. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-13 14:41:52 -07:00
Tejun Heo	ce36394269	mmap: fix petty bug in anonymous shared mmap offset handling Anonymous mappings should ignore offset but shared anonymous mapping forgot to clear it and makes the following legit test program trigger SIGBUS. #include <sys/mman.h> #include <stdio.h> #include <errno.h> #define PAGE_SIZE 4096 int main(void) { char p; int i; p = mmap(NULL, 2 PAGE_SIZE, PROT_READ\|PROT_WRITE, MAP_SHARED\|MAP_ANONYMOUS, -1, PAGE_SIZE); if (p == MAP_FAILED) { perror("mmap"); return 1; } for (i = 0; i < 2; i++) { printf("page %d\n", i); p[i * 4096] = i; } return 0; } Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-03 19:58:53 -07:00
KOSAKI Motohiro	b954185214	mm: size of quicklists shouldn't be proportional to the number of CPUs Quicklists store pages for each CPU as caches. (Each CPU can cache node_free_pages/16 pages) It is used for page table cache. exit() will increase the cache size, while fork() consumes it. So for example if an apache-style application runs (one parent and many child model), one CPU process will fork() while another CPU will process the middleware work and exit(). At that time, the CPU on which the parent runs doesn't have page table cache at all. Others (on which children runs) have maximum caches. QList_max = (#ofCPUs - 1) x Free / 16 => QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1) So, How much quicklist memory is used in the maximum case? This is proposional to # of CPUs because the limit of per cpu quicklist cache doesn't see the number of cpus. Above calculation mean Number of CPUs per node 2 4 8 16 ============================== ==================== QList_max / (Free + QList_max) 5.8% 16% 30% 48% Wow! Quicklist can spend about 50% memory at worst case. My demonstration program is here -------------------------------------------------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <sched.h> #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #define BUFFSIZE 512 int max_cpu(void) /* get max number of logical cpus from /proc/cpuinfo / { FILE fd; char ret, buffer[BUFFSIZE]; int cpu = 1; fd = fopen("/proc/cpuinfo", "r"); if (fd == NULL) { perror("fopen(/proc/cpuinfo)"); exit(EXIT_FAILURE); } while (1) { ret = fgets(buffer, BUFFSIZE, fd); if (ret == NULL) break; if (!strncmp(buffer, "processor", 9)) cpu = atoi(strchr(buffer, ':') + 2); } fclose(fd); return cpu; } void cpu_bind(int cpu) / bind current process to one cpu / { cpu_set_t mask; int ret; CPU_ZERO(&mask); CPU_SET(cpu, &mask); ret = sched_setaffinity(0, sizeof(mask), &mask); if (ret == -1) { perror("sched_setaffinity()"); exit(EXIT_FAILURE); } sched_yield(); / not necessary / } #define MMAP_SIZE (10 1024 * 1024) /* 10 MB / #define FORK_INTERVAL 1 / 1 second / main(int argc, char argv[]) { int cpu_max, nextcpu; long pagesize; pid_t pid; /* set max number of logical cpu / if (argc > 1) cpu_max = atoi(argv[1]) - 1; else cpu_max = max_cpu(); / get the page size / pagesize = sysconf(_SC_PAGESIZE); if (pagesize == -1) { perror("sysconf(_SC_PAGESIZE)"); exit(EXIT_FAILURE); } / prepare parent process / cpu_bind(0); nextcpu = cpu_max; loop: / select destination cpu for child process by round-robin rule / if (++nextcpu > cpu_max) nextcpu = 1; pid = fork(); if (pid == 0) { / child action / char p; int i; /* consume page tables / p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, 0, 0); i = MMAP_SIZE / pagesize; while (i-- > 0) { p = 1; p += pagesize; } /* move to other cpu / cpu_bind(nextcpu); / printf("a child moved to cpu%d after mmap().\n", nextcpu); fflush(stdout); / / back page tables to pgtable_quicklist / exit(0); } else if (pid > 0) { / parent action */ sleep(FORK_INTERVAL); waitpid(pid, NULL, WNOHANG); } goto loop; } ---------------------------------------- When above program which does task migration runs, my 8GB box spends 800MB of memory for quicklist. This is not memory leak but doesn't seem good. % cat /proc/meminfo MemTotal: 7701568 kB MemFree: 4724672 kB (snip) Quicklists: 844800 kB because - My machine spec is number of numa node: 2 number of cpus: 8 (4CPU x2 node) total mem: 8GB (4GB x2 node) free mem: about 5GB - Then, 4.7GB x 16% ~= 880MB. So, Quicklist can use 800MB. So, if following spec machine run that program CPUs: 64 (8cpu x 8node) Mem: 1TB (128GB x8node) Then, quicklist can waste 300GB (= 1TB x 30%). It is too large. So, I don't like cache policies which is proportional to # of cpus. My patch changes the number of caches from: per-cpu-cache-amount = memory_on_node / 16 to per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Tested-by: David Miller <davem@davemloft.net> Acked-by: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:38 -07:00
Marcin Slusarz	527655835e	mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data The variable contig_page_data references the variable __initdata bootmem_node_data If the reference is valid then annotate the variable with __init* (see linux/init.h) or name the variable: driver, _template, _timer, _sht, _ops, _probe, _probe_one, _console, Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Sean MacLennan <smaclennan@pikatech.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:37 -07:00
Hisashi Hifumi	6ccfa806a9	VFS: fix dio write returning EIO when try_to_release_page fails Dio write returns EIO when try_to_release_page fails because bh is still referenced. The patch commit `3f31fddfa2` Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 25 01:46:22 2008 -0700 jbd: fix race between free buffer and commit transaction was merged into 2.6.27-rc1, but I noticed that this patch is not enough to fix the race. I did fsstress test heavily to 2.6.27-rc1, and found that dio write still sometimes got EIO through this test. The patch above fixed race between freeing buffer(dio) and committing transaction(jbd) but I discovered that there is another race, freeing buffer(dio) and ext3/4_ordered_writepage. : background_writeout() ->write_cache_pages() ->ext3_ordered_writepage() walk_page_buffers() -> take a bh ref block_write_full_page() -> unlock_page : <- end_page_writeback : <- race! (dio write->try_to_release_page fails) walk_page_buffers() ->release a bh ref ext3_ordered_writepage holds bh ref and does unlock_page remaining taking a bh ref, so this causes the race and failure of try_to_release_page. To fix this race, I used the approach of falling back to buffered writes if try_to_release_page() fails on a page. [akpm@linux-foundation.org: cleanups] Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:37 -07:00
Adam Litke	344c790e38	mm: make setup_zone_migrate_reserve() aware of overlapping nodes I have gotten to the root cause of the hugetlb badness I reported back on August 15th. My system has the following memory topology (note the overlapping node): Node 0 Memory: 0x8000000-0x44000000 Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000 setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding no candidates, it happily continues the scan into 0x8000000-0x44000000. When a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on the wrong zone. Oops. setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:37 -07:00
David Woodhouse	0ed97ee470	Remove '#include <stddef.h>' from mm/page_isolation.c Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2008-09-02 09:29:01 +01:00
Linus Torvalds	41c3e45f08	Merge master.kernel.org:/home/rmk/linux-2.6-arm * master.kernel.org:/home/rmk/linux-2.6-arm: [ARM] 5226/1: remove unmatched comment end. [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo [ARM] use bcd2bin/bin2bcd [ARM] use the new byteorder headers [ARM] OMAP: Fix 2430 SMC91x ethernet IRQ [ARM] OMAP: Add and update OMAP default configuration files [ARM] OMAP: Change mailing list for OMAP in MAINTAINERS [ARM] S3C2443: Fix the S3C2443 clock register definitions [ARM] JIVE: Fix the spi bus numbering [ARM] S3C24XX: pwm.c: stop debugging output [ARM] S3C24XX: Fix sparse warnings in pwm.c [ARM] S3C24XX: Fix spare errors in pwm-clock driver [ARM] S3C24XX: Fix sparse warnings in arch/arm/plat-s3c24xx/gpiolib.c [ARM] S3C24XX: Fix nor-simtec driver sparse errors [ARM] 5225/1: zaurus: Register I2C controller for audio codecs [ARM] orion5x: update defconfig to v2.6.27-rc4 [ARM] Orion: register UART1 on QNAP TS-209 and TS-409 [ARM] Orion: activate lm75 driver on DNS-323 [ARM] Orion: fix MAC detection on QNAP TS-209 and TS-409 [ARM] Orion: Fix boot crash on Kurobox Pro	2008-08-28 12:34:27 -07:00
Linus Torvalds	e4268bd3b2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Disable NUMA remote node defragmentation by default	2008-08-27 14:33:06 -07:00
Mel Gorman	e80d6a2482	[ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo Ordinarily, memory holes in flatmem still have a valid memmap and is safe to use. However, an architecture (ARM) frees up the memmap backing memory holes on the assumption it is never used. /proc/pagetypeinfo reads the whole range of pages in a zone believing that the memmap is valid and that pfn_valid will return false if it is not. On ARM, freeing the memmap breaks the page->zone linkages even though pfn_valid() returns true and the kernel can oops shortly afterwards due to accessing a bogus struct zone *. This patch lets architectures say when FLATMEM can have holes in the memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo will confirm that the page linkages are still valid by checking page->zone is still the expected zone. The lookup of page_zone is safe as there is a limited range of memory that is accessed when calling page_zone. Even if page_zone happens to return the correct zone, the impact is that the counters in /proc/pagetypeinfo are slightly off but fragmentation monitoring is unlikely to be relevant on an embedded system. Reported-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2008-08-27 20:09:28 +01:00
Ingo Molnar	8b53b57576	Merge branch 'x86/urgent' into x86/pat Conflicts: arch/x86/mm/pageattr.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 06:06:51 +02:00
Nick Piggin	14bac5acfd	mm: xip/ext2 fix block allocation race XIP can call into get_xip_mem concurrently with the same file,offset with create=1. This usually maps down to get_block, which expects the page lock to prevent such a situation. This causes ext2 to explode for one reason or another. Serialise those calls for the moment. For common usages today, I suspect get_xip_mem rarely is called to create new blocks. In future as XIP technologies evolve we might need to look at which operations require scalability, and rework the locking to suit. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Acked-by: Carsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:32 -07:00
Nick Piggin	538f8ea6c8	mm: xip fix fault vs sparse page invalidate race XIP has a race between sparse pages being inserted into page tables, and sparse pages being zapped when its time to put a non-sparse page in. What can happen is that a process can be left with a dangling sparse page in a MAP_SHARED mapping, while the rest of the world sees the non-sparse version. Ie. data corruption. Guard these operations with a seqlock, making fault-in-sparse-pages the slowpath, and try-to-unmap-sparse-pages the fastpath. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Acked-by: Carsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:32 -07:00
Nick Piggin	479db0bf40	mm: dirty page tracking race fix There is a race with dirty page accounting where a page may not properly be accounted for. clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty. page_mkclean walks the rmaps for that page, and for each one it cleans and write protects the pte if it was dirty. It uses page_check_address to find the pte. That function has a shortcut to avoid the ptl if the pte is not present. Unfortunately, the pte can be switched to not-present then back to present by other code while holding the page table lock -- this should not be a signal for page_mkclean to ignore that pte, because it may be dirty. For example, powerpc64's set_pte_at will clear a previously present pte before setting it to the desired value. There may also be other code in core mm or in arch which do similar things. The consequence of the bug is loss of data integrity due to msync, and loss of dirty page accounting accuracy. XIP's __xip_unmap could easily also be unreliable (depending on the exact XIP locking scheme), which can lead to data corruption. Fix this by having an option to always take ptl to check the pte in page_check_address. It's possible to retain this optimization for page_referenced and try_to_unmap. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:32 -07:00
Johannes Weiner	481ebd0d76	bootmem: fix aligning of node-relative indexes and offsets Absolute alignment requirements may never be applied to node-relative offsets. Andreas Herrmann spotted this flaw when a bootmem allocation on an unaligned node was itself not aligned because the combination of an unaligned node with an aligned offset into that node is not garuanteed to be aligned itself. This patch introduces two helper functions that align a node-relative index or offset with respect to the node's starting address so that the absolute PFN or virtual address that results from combining the two satisfies the requested alignment. Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these helpers. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Reported-by: Andreas Herrmann <andreas.herrmann3@amd.com> Debugged-by: Andreas Herrmann <andreas.herrmann3@amd.com> Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com> Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:31 -07:00
Marcin Slusarz	759f9a2df7	mm: mminit_loglevel cannot be __meminitdata anymore mminit_loglevel is now used from mminit_verify_zonelist <- build_all_zonelists <- 1. online_pages <- memory_block_action <- memory_block_change_state <- store_mem_state (sys handler) 2. numa_zonelist_order_handler (proc handler) so it cannot be annotated __meminit - drop it fixes following section mismatch warning: WARNING: vmlinux.o(.text+0x71628): Section mismatch in reference from the function mminit_verify_zonelist() to the variable .meminit.data:mminit_loglevel The function mminit_verify_zonelist() references the variable __meminitdata mminit_loglevel. This is often because mminit_verify_zonelist lacks a __meminitdata annotation or the annotation of mminit_loglevel is wrong. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:30 -07:00
Hugh Dickins	07279cdfd9	mm: show free swap as signed Adjust <Alt><SysRq>m show_swap_cache_info() to show "Free swap" as a signed long: the signed format is preferable, because during swapoff nr_swap_pages can legitimately go negative, so makes more sense thus (it used to be shown redundantly, once as signed and once as unsigned). Signed-off-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:30 -07:00
Hugh Dickins	16f8c5b2e6	mm: page_remove_rmap comments on PageAnon Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty dance in page_remove_rmap(): I was wrong to think the PageSwapCache test could be avoided, and would like a comment in there to remind me. And mention s390, to help us remember that this block is not really common. Also move down the "It would be tidy to reset PageAnon" comment: it does not belong to s390's block, and it would be unwise to reset PageAnon before we're done with testing it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:30 -07:00
Christoph Lameter	e2cb96b7ec	slub: Disable NUMA remote node defragmentation by default Switch remote node defragmentation off by default. The current settings can cause excessive node local allocations with hackbench: SLAB: % cat /proc/meminfo MemTotal: 7701760 kB MemFree: 5940096 kB Slab: 123840 kB SLUB: % cat /proc/meminfo MemTotal: 7701376 kB MemFree: 4740928 kB Slab: 1591680 kB [Note: this feature is not related to slab defragmentation.] You can find the original discussion here: http://lkml.org/lkml/2008/8/4/308 Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2008-08-20 21:50:21 +03:00
Linus Torvalds	71ef2a46fc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: security: Fix setting of PF_SUPERPRIV by __capable()	2008-08-15 15:32:13 -07:00
Mikulas Patocka	627240aaa9	bootmem allocator: alloc_bootmem_core(): page-align the end offset This is the minimal sequence that jams the allocator: void p, q, *r; p = alloc_bootmem(PAGE_SIZE); q = alloc_bootmem(64); free_bootmem(p, PAGE_SIZE); p = alloc_bootmem(PAGE_SIZE); r = alloc_bootmem(64); after this sequence (assuming that the allocator was empty or page-aligned before), pointer "q" will be equal to pointer "r". What's hapenning inside the allocator: p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000... q = alloc_bootmem(64); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000... free_bootmem(p, PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000... p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000... r = alloc_bootmem(64); and now: it finds bit "2", as a place where to allocate (sidx) it hits the condition if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx)) start_off = ALIGN(bdata->last_end_off, align); -you can see that the condition is true, so it assigns start_off = ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates over already allocated block. With the patch it tries to continue at the end of previous allocation only if the previous allocation ended in the middle of the page. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Johannes Weiner <hannes@saeurebad.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:41 -07:00
Nick Piggin	5843d9a4d0	x86, pat: avoid highmem cache attribute aliasing Highmem code can leave ptes and tlb entries around for a given page even after kunmap, and after it has been freed. >From what I can gather, the PAT code may change the cache attributes of arbitrary physical addresses (ie. including highmem pages), which would result in aliases in the case that it operates on one of these lazy tlb highmem pages. Flushing kmaps should solve the problem. I've also just added code for conditional flushing if we haven't got any dangling highmem aliases -- this should help performance if we change page attributes frequently or systems that aren't using much highmem pages (eg. if < 4G RAM). Should be turned into 2 patches, but just for RFC... Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 17:22:57 +02:00
David Howells	5cd9c58fbe	security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>	2008-08-14 22:59:43 +10:00
Huang Weiyi	fc1efbdb7a	mm/sparse.c: removed duplicated include Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:30 -07:00
MinChan Kim	d6bf73e434	do_migrate_pages(): remove unused variable Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:29 -07:00
Andy Whitcroft	2b26736c88	allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 [Andrew this should replace the previous version which did not check the returns from the region prepare for errors. This has been tested by us and Gerald and it looks good. Bah, while reviewing the locking based on your previous email I spotted that we need to check the return from the vma_needs_reservation call for allocation errors. Here is an updated patch to correct this. This passes testing here.] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:28 -07:00
Andy Whitcroft	57303d8017	hugetlbfs: allocate structures for reservation tracking outside of spinlocks In the normal case, hugetlbfs reserves hugepages at map time so that the pages exist for future faults. A struct file_region is used to track when reservations have been consumed and where. These file_regions are allocated as necessary with kmalloc() which can sleep with the mm->page_table_lock held. This is wrong and triggers may-sleep warning when PREEMPT is enabled. Updates to the underlying file_region are done in two phases. The first phase prepares the region for the change, allocating any necessary memory, without actually making the change. The second phase actually commits the change. This patch makes use of this by checking the reservations before the page_table_lock is taken; triggering any necessary allocations. This may then be safely repeated within the locks without any allocations being required. Credit to Mel Gorman for diagnosing this failure and initial versions of the patch. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:28 -07:00
Hugh Dickins	9623e078c1	memcg: fix oops in mem_cgroup_shrink_usage Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs: yes, of course, loop0 has no mm: other entry points check but this didn't. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:28 -07:00
Jan Beulich	74768ed833	page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash() .. since a failed allocation is being (initially) handled gracefully, and panic()-ed upon failure explicitly in the function if retries with smaller sizes failed. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:27 -07:00
Gerald Schaefer	caff3a2c33	hugetlb: call arch_prepare_hugepage() for surplus pages The s390 software large page emulation implements shared page tables by using page->index of the first tail page from a compound large page to store page table information. This is set up in arch_prepare_hugepage(), which is called from alloc_fresh_huge_page_node(). A similar call to arch_prepare_hugepage() is missing for surplus large pages that are allocated in alloc_buddy_huge_page(), which breaks the software emulation mode for (surplus) large pages on s390. This patch adds the missing call to arch_prepare_hugepage(). It will have no effect on other architectures where arch_prepare_hugepage() is a nop. Also, use the correct order in the error path in alloc_fresh_huge_page_node(). Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:27 -07:00
Linus Torvalds	1c89ac5501	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: fix spinlock recursion in hvc_console stop_machine: remove unused variable modules: extend initcall_debug functionality to the module loader export virtio_rng.h lguest: use get_user_pages_fast() instead of get_user_pages() mm: Make generic weak get_user_pages_fast and EXPORT_GPL it lguest: don't set MAC address for guest unless specified	2008-08-12 08:40:19 -07:00
Rusty Russell	912985dce4	mm: Make generic weak get_user_pages_fast and EXPORT_GPL it Out of line get_user_pages_fast fallback implementation, make it a weak symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST. Export the symbol to modules so lguest can use it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-08-12 17:52:53 +10:00
Linus Torvalds	9b4d0bab32	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix debug_lock_alloc lockdep: increase MAX_LOCKDEP_KEYS generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask() lockdep: fix overflow in the hlock shrinkage code lockdep: rename map_[acquire\|release]() => lock_map_[acquire\|release]() lockdep: handle chains involving classes defined in modules mm: fix mm_take_all_locks() locking order lockdep: annotate mm_take_all_locks() lockdep: spin_lock_nest_lock() lockdep: lock protection locks lockdep: map_acquire lockdep: shrink held_lock structure lockdep: re-annotate scheduler runqueues lockdep: lock_set_subclass - reset a held lock's subclass lockdep: change scheduler annotation debug_locks: set oops_in_progress if we will log messages. lockdep: fix combinatorial explosion in lock subgraph traversal	2008-08-11 16:45:46 -07:00
Ingo Molnar	23a0ee908c	Merge branch 'core/locking' into core/urgent	2008-08-12 00:11:49 +02:00
Peter Zijlstra	7cd5a02f54	mm: fix mm_take_all_locks() locking order Lockdep spotted: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.27-rc1 #270 ------------------------------------------------------- qemu-kvm/2033 is trying to acquire lock: (&inode->i_data.i_mmap_lock){----}, at: [<ffffffff802996cc>] mm_take_all_locks+0xc2/0xea but task is already holding lock: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&anon_vma->lock){----}: [<ffffffff8025cd37>] __lock_acquire+0x11be/0x14d2 [<ffffffff8025d0a9>] lock_acquire+0x5e/0x7a [<ffffffff804c655b>] _spin_lock+0x3b/0x47 [<ffffffff8029a2ef>] vma_adjust+0x200/0x444 [<ffffffff8029a662>] split_vma+0x12f/0x146 [<ffffffff8029bc60>] mprotect_fixup+0x13c/0x536 [<ffffffff8029c203>] sys_mprotect+0x1a9/0x21e [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (&inode->i_data.i_mmap_lock){----}: [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2 [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219 [<ffffffff8025d515>] lock_release+0x127/0x14a [<ffffffff804c6403>] _spin_unlock+0x1e/0x50 [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0 [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112 [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10 [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm] [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78 [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274 [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78 [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 5 locks held by qemu-kvm/2033: #0: (&mm->mmap_sem){----}, at: [<ffffffff802a95d0>] do_mmu_notifier_register+0x55/0x112 #1: (mm_all_locks_mutex){--..}, at: [<ffffffff8029963e>] mm_take_all_locks+0x34/0xea #2: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea #3: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea #4: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea stack backtrace: Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270 Call Trace: [<ffffffff8025b7c7>] print_circular_bug_tail+0xb8/0xc3 [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2 [<ffffffff80259bb1>] ? add_lock_to_list+0x7e/0xad [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219 [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea [<ffffffff8025b202>] ? trace_hardirqs_on_caller+0x4d/0x115 [<ffffffff802995d9>] ? mm_drop_all_locks+0x7f/0xb0 [<ffffffff8025d515>] lock_release+0x127/0x14a [<ffffffff804c6403>] _spin_unlock+0x1e/0x50 [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0 [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112 [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10 [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm] [<ffffffff8033f9f2>] ? file_has_perm+0x83/0x8e [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78 [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274 [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78 [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b Which the locking hierarchy in mm/rmap.c confirms as valid. Fix this by first taking all the mapping->i_mmap_lock instances and then take all anon_vma->lock instances. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:25 +02:00
Peter Zijlstra	454ed842d5	lockdep: annotate mm_take_all_locks() The nesting is correct due to holding mmap_sem, use the new annotation to annotate this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:25 +02:00
Linus Torvalds	4fbb71597a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLUB: dynamic per-cache MIN_PARTIAL mm: unexport ksize	2008-08-09 16:21:33 -07:00
Linus Torvalds	d6606683a5	Revert duplicate "mm/hugetlb.c must #include <asm/io.h>" This reverts commit `7cb9318162`, since we did that patch twice, and the problem was already fixed earlier by `78a34ae29b`. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-06 12:04:54 -07:00
Benny Halevy	dfe195fb79	mm: fix uninitialized variables for find_vma_prepare callers gcc 4.3.0 correctly emits the following warnings. When a vma covering addr is found, find_vma_prepare indeed returns without setting pprev, rb_link, and rb_parent. mm/mmap.c: In function `insert_vm_struct': mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `copy_vma': mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `do_brk': mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `mmap_region': mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function Hugh adds: in fact, none of find_vma_prepare's callers use those values when a vma is found to be already covering addr, it's either an error or an occasion to munmap and repeat. Okay, let's quieten the compiler (but I would prefer it if pprev, rb_link and rb_parent were meaningful in that case, rather than whatever's in them from descending the tree). Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: "Ryan Hope" <rmh3093@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:50 -07:00
Andrew Morton	5c9ffc9c3d	mm_init.c: avoid ifdef-inside-macro-expansion gcc-3.2: mm/mm_init.c:77:1: directives may not be used inside a macro argument mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk" mm/mm_init.c: In function `mminit_verify_pageflags_layout': mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function) mm/mm_init.c:80: (Each undeclared identifier is reported only once mm/mm_init.c:80: for each function it appears in.) mm/mm_init.c:80: syntax error before numeric constant Also fix a typo in a comment. Reported-by: Adrian Bunk <bunk@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:46 -07:00
Pekka Enberg	5595cffc82	SLUB: dynamic per-cache MIN_PARTIAL This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial value that is calculated from object size. The bigger the object size, the more pages we keep on the partial list. I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example script of the fio benchmarking tool. The script stresses the networking subsystem which should also give a fairly good beating of kmalloc() et al. To run the test yourself, first clone the fio repository: git clone git://git.kernel.dk/fio.git and then run the following command n times on your machine: time ./fio examples/netio The results on my 2-way 64-bit x86 machine are as follows: [ the minimum, maximum, and average are captured from 50 individual runs ] real time (seconds) min max avg sd SLAB 22.76 23.38 22.98 0.17 SLUB 22.80 25.78 23.46 0.72 SLUB (dynamic) 22.74 23.54 23.00 0.20 sys time (seconds) min max avg sd SLAB 6.90 8.28 7.70 0.28 SLUB 7.42 16.95 8.89 2.28 SLUB (dynamic) 7.17 8.64 7.73 0.29 user time (seconds) min max avg sd SLAB 36.89 38.11 37.50 0.29 SLUB 30.85 37.99 37.06 1.67 SLUB (dynamic) 36.75 38.07 37.59 0.32 As you can see from the above numbers, this patch brings SLUB to the same level as SLAB for this particular workload fixing a ~2% regression. I'd expect this change to help similar workloads that allocate a lot of objects that are close to the size of a page. Cc: Matthew Wilcox <matthew@wil.cx> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>	2008-08-05 09:28:47 +03:00
Nick Piggin	529ae9aaa0	mm: rename page trylock Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-04 21:31:34 -07:00
Linus Torvalds	2e1e9212ed	Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (29 commits) sh: enable maple_keyb in dreamcast_defconfig. SH2(A) cache update nommu: Provide vmalloc_exec(). add addrespace definition for sh2a. sh: Kill off ARCH_SUPPORTS_AOUT and remnants of a.out support. sh: define GENERIC_HARDIRQS_NO__DO_IRQ. sh: define GENERIC_LOCKBREAK. sh: Save NUMA node data in vmcore for crash dumps. sh: module_alloc() should be using vmalloc_exec(). sh: Fix up __bug_table handling in module loader. sh: Add documentation and integrate into docbook build. sh: Fix up broken kerneldoc comments. maple: Kill useless private_data pointer. maple: Clean up maple_driver_register/unregister routines. input: Clean up maple keyboard driver maple: allow removal and reinsertion of keyboard driver module sh: /proc/asids depends on MMU. arch/sh/boards/mach-se/7343/irq.c: removed duplicated #include arch/sh/boards/board-ap325rxa.c: removed duplicated #include sh/boards/Makefile typo fix ...	2008-08-04 17:26:15 -07:00
KOSAKI Motohiro	a477097d9c	mlock() fix return values Halesh says: Please find the below testcase provide to test mlock. Test Case : =========================== #include <sys/resource.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <errno.h> #include <stdlib.h> int main(void) { int fd,ret, i = 0; char addr, addr1 = NULL; unsigned int page_size; struct rlimit rlim; if (0 != geteuid()) { printf("Execute this pgm as root\n"); exit(1); } /* create a file / if ((fd = open("mmap_test.c",O_RDWR\|O_CREAT,0755)) == -1) { printf("cant create test file\n"); exit(1); } page_size = sysconf(_SC_PAGE_SIZE); / set the MEMLOCK limit / rlim.rlim_cur = 2000; rlim.rlim_max = 2000; if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0) { printf("Cant change limit values\n"); exit(1); } addr = 0; while (1) { / map a page into memory each time/ if ((addr = (char ) mmap(addr,page_size, PROT_READ \| PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED) { printf("cant do mmap on file\n"); exit(1); } if (0 == i) addr1 = addr; i++; errno = 0; /* lock the mapped memory pagewise/ if ((ret = mlock((char )addr, 1500)) == -1) { printf("errno value is %d\n", errno); printf("cant lock maped region\n"); exit(1); } addr = addr + page_size; } } ====================================================== This testcase results in an mlock() failure with errno 14 that is EFAULT, but it has nowhere been specified that mlock() will return EFAULT. When I tested the same on older kernels like 2.6.18, I got the correct result i.e errno 12 (ENOMEM). I think in source code mlock(2), setting errno ENOMEM has been missed in do_mlock() , on mlock_fixup() failure. SUSv3 requires the following behavior frmo mlock(2). [ENOMEM] Some or all of the address range specified by the addr and len arguments does not correspond to valid mapped pages in the address space of the process. [EAGAIN] Some or all of the memory identified by the operation could not be locked when the call was made. This rule isn't so nice and slighly strange. but many people think POSIX/SUS compliance is important. Reported-by: Halesh Sadashiv <halesh.sadashiv@ap.sony.com> Tested-by: Halesh Sadashiv <halesh.sadashiv@ap.sony.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-04 16:58:45 -07:00
Paul Mundt	1af446edfe	nommu: Provide vmalloc_exec(). Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage, it's apparent that nommu has no vmalloc_exec() definition of its own. Stub in the one from mm/vmalloc.c. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2008-08-04 16:01:47 +09:00
Miklos Szeredi	84209e02de	mm: dont clear PG_uptodate on truncate/invalidate Brian Wang reported that a FUSE filesystem exported through NFS could return I/O errors on read. This was traced to splice_direct_to_actor() returning a short or zero count when racing with page invalidation. However this is not FUSE or NFSD specific, other filesystems (notably NFS) also call invalidate_inode_pages2() to purge stale data from the cache. If this happens while such pages are sitting in a pipe buffer, then splice(2) from the pipe can return zero, and read(2) from the pipe can return ENODATA. The zero return is especially bad, since it implies end-of-file or disconnected pipe/socket, and is documented as such for splice. But returning an error for read() is also nasty, when in fact there was no error (data becoming stale is not an error). The same problems can be triggered by "hole punching" with madvise(MADV_REMOVE). Fix this by not clearing the PG_uptodate flag on truncation and invalidation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-02 09:12:34 -07:00
Jack Steiner	3669bc143f	Remove EXPORTS of follow_page & zap_page_range Delete 2 EXPORTs that were accidentally sent upstream. Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-01 13:19:16 -07:00
Benjamin Herrenschmidt	0ef89d25d3	mm/hugetlb: don't crash when HPAGE_SHIFT is 0 Some platform decide whether they support huge pages at boot time. On these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is set to 0 when there is no such support. The patches to introduce multiple huge pages support broke that causing the kernel to crash at boot time on machines such as POWER3 which lack support for multiple page sizes. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-01 12:46:41 -07:00

1 2 3 4 5 ...

2529 Коммитов