linux/mm/page_alloc.c, branch v3.0

Revert "mm: fail GFP_DMA allocations when ZONE_DMA is not configured"

2011-06-01T21:11:24Z

This reverts commit a197b59ae6e8bee56fcef37ea2482dc08414e2ac. As rmk says: "Commit a197b59ae6e8 (mm: fail GFP_DMA allocations when ZONE_DMA is not configured) is causing regressions on ARM with various drivers which use GFP_DMA. The behaviour up until now has been to silently ignore that flag when CONFIG_ZONE_DMA is not enabled, and to allocate from the normal zone. However, as a result of the above commit, such allocations now fail which causes drivers to fail. These are regressions compared to the previous kernel version." so just revert it. Requested-by: Russell King Acked-by: Andrew Morton Cc: David Rientjes Signed-off-by: Linus Torvalds

memcg: fix get_scan_count() for small targets

2011-05-27T00:12:35Z

During memory reclaim we determine the number of pages to be scanned per zone as (anon + file) >> priority. Assume scan = (anon + file) >> priority. If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and priority gets higher. This has some problems. 1. This increases priority as 1 without any scan. To do scan in this priority, amount of pages should be larger than 512M. If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be batched, later. (But we lose 1 priority.) If memory size is below 16M, pages >> priority is 0 and no scan in DEF_PRIORITY forever. 2. If zone->all_unreclaimabe==true, it's scanned only when priority==0. So, x86's ZONE_DMA will never be recoverred until the user of pages frees memory by itself. 3. With memcg, the limit of memory can be small. When using small memcg, it gets priority < DEF_PRIORITY-2 very easily and need to call wait_iff_congested(). For doing scan before priorty=9, 64MB of memory should be used. Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when 1. the target is enough small. 2. it's kswapd or memcg reclaim. Then we can avoid rapid priority drop and may be able to recover all_unreclaimable in a small zones. And this patch removes nr_saved_scan. This will allow scanning in this priority even when pages >> priority is very small. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Ying Han Cc: Balbir Singh Cc: KOSAKI Motohiro Cc: Daisuke Nishimura Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/page_alloc.c: prevent unending loop in __alloc_pages_slowpath()

2011-05-25T15:39:36Z

I believe I found a problem in __alloc_pages_slowpath, which allows a process to get stuck endlessly looping, even when lots of memory is available. Running an I/O and memory intensive stress-test I see a 0-order page allocation with __GFP_IO and __GFP_WAIT, running on a system with very little free memory. Right about the same time that the stress-test gets killed by the OOM-killer, the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even though most of the systems memory was freed by the oom-kill of the stress-test. The utility ends up looping from the rebalance label down through the wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact skips the call to get_page_from_freelist. Because all of the reclaimable memory on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then jumps back to rebalance without ever trying to get_page_from_freelist. This loop repeats infinitely. The test case is pretty pathological. Running a mix of I/O stress-tests that do a lot of fork() and consume all of the system memory, I can pretty reliably hit this on 600 nodes, in about 12 hours. 32GB/node. Signed-off-by: Andrew Barry Signed-off-by: Minchan Kim Reviewed-by: Rik van Riel Acked-by: Mel Gorman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: fail GFP_DMA allocations when ZONE_DMA is not configured

2011-05-25T15:39:29Z

The page allocator will improperly return a page from ZONE_NORMAL even when __GFP_DMA is passed if CONFIG_ZONE_DMA is disabled. The caller expects DMA memory, perhaps for ISA devices with 16-bit address registers, and may get higher memory resulting in undefined behavior. This patch causes the page allocator to return NULL in such circumstances with a warning emitted to the kernel log on the first occurrence. Signed-off-by: David Rientjes Cc: Mel Gorman Cc: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE

2011-05-25T15:39:24Z

This fixes a problem where the first pageblock got marked MIGRATE_RESERVE even though it only had a few free pages. eg, On current ARM port, The kernel starts at offset 0x8000 to leave room for boot parameters, and the memory is freed later. This in turn caused no contiguous memory to be reserved and frequent kswapd wakeups that emptied the caches to get more contiguous memory. Unfortunatelly, ARM needs order-2 allocation for pgd (see arm/mm/pgd.c#pgd_alloc()). Therefore the issue is not minor nor easy avoidable. [kosaki.motohiro@jp.fujitsu.com: added some explanation] [kosaki.motohiro@jp.fujitsu.com: add !pfn_valid_within() to check] [minchan.kim@gmail.com: check end_pfn in pageblock_is_reserved] Signed-off-by: John Stultz Signed-off-by: Arve Hjønnevåg Signed-off-by: KOSAKI Motohiro Acked-by: Mel Gorman Acked-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: break out page allocation warning code

2011-05-25T15:39:21Z

This originally started as a simple patch to give vmalloc() some more verbose output on failure on top of the plain page allocator messages. Johannes suggested that it might be nicer to lead with the vmalloc() info _before_ the page allocator messages. But, I do think there's a lot of value in what __alloc_pages_slowpath() does with its filtering and so forth. This patch creates a new function which other allocators can call instead of relying on the internal page allocator warnings. It also gives this function private rate-limiting which separates it from other printk_ratelimit() users. Signed-off-by: Dave Hansen Cc: Johannes Weiner Cc: David Rientjes Cc: Michal Nazarewicz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD

2011-05-25T15:39:10Z

It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt at async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli Cc: Alex Villacis Lasso Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur

2011-05-25T15:39:09Z

Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug doesn't. There is no reason for this. [akpm@linux-foundation.org: fix CONFIG_SMP=n build] Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Mel Gorman Acked-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occurs

2011-05-25T15:39:09Z

Currently, memory hotplug calls setup_per_zone_wmarks() and calculate_zone_inactive_ratio(), but doesn't call setup_per_zone_lowmem_reserve(). It means the number of reserved pages aren't updated even if memory hot plug occur. This patch fixes it. Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Mel Gorman Reviewed-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() should be __meminit.

2011-05-25T15:39:08Z

Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of zone when hotplug happens") introduced invalid section references. Now, setup_per_zone_inactive_ratio() is marked __init and then it can't be referenced from memory hotplug code. This patch marks it as __meminit and also marks caller as __ref. Signed-off-by: KOSAKI Motohiro Reviewed-by: Minchan Kim Cc: Yasunori Goto Cc: Rik van Riel Cc: Johannes Weiner Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds