linux/mm/sparse.c, branch v3.8

memory-hotplug, mm/sparse.c: clear the memory to store struct page

2012-12-12T01:22:23Z

If sparse memory vmemmap is enabled, we can't free the memory to store struct page when a memory device is hotremoved, because we may store struct page in the memory to manage the memory which doesn't belong to this memory device. When we hotadded this memory device again, we will reuse this memory to store struct page, and struct page may contain some obsolete information, and we will get bad-page state: init_memory_mapping: [mem 0x80000000-0x9fffffff] Built 2 zonelists in Node order, mobility grouping on. Total pages: 547617 Policy zone: Normal BUG: Bad page state in process bash pfn:9b6dc page:ffffea0002200020 count:0 mapcount:0 mapping: (null) index:0xfdfdfdfdfdfdfdfd page flags: 0x2fdfdfdfd5df9fd(locked|referenced|uptodate|dirty|lru|active|slab|owner_priv_1|private|private_2|writeback|head|tail|swapcache|reclaim|swapbacked|unevictable|uncached|compound_lock) Modules linked in: netconsole acpiphp pci_hotplug acpi_memhotplug loop kvm_amd kvm microcode tpm_tis tpm tpm_bios evdev psmouse serio_raw i2c_piix4 i2c_core parport_pc parport processor button thermal_sys ext3 jbd mbcache sg sr_mod cdrom ata_generic virtio_net ata_piix virtio_blk libata virtio_pci virtio_ring virtio scsi_mod Pid: 988, comm: bash Not tainted 3.6.0-rc7-guest #12 Call Trace: [] ? bad_page+0xb0/0x100 [] ? free_pages_prepare+0xb3/0x100 [] ? free_hot_cold_page+0x48/0x1a0 [] ? online_pages_range+0x68/0xa0 [] ? __online_page_increment_counters+0x10/0x10 [] ? walk_system_ram_range+0x101/0x110 [] ? online_pages+0x1a5/0x2b0 [] ? __memory_block_change_state+0x20d/0x270 [] ? store_mem_state+0xb6/0xf0 [] ? sysfs_write_file+0xd2/0x160 [] ? vfs_write+0xaa/0x160 [] ? sys_write+0x47/0x90 [] ? async_page_fault+0x25/0x30 [] ? system_call_fastpath+0x16/0x1b Disabling lock debugging due to kernel taint This patch clears the memory to store struct page to avoid unexpected error. Signed-off-by: Wen Congyang Cc: David Rientjes Cc: Jiang Liu Cc: Minchan Kim Acked-by: KOSAKI Motohiro Cc: Yasuaki Ishimatsu Reported-by: Vasilis Liaskovitis Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memory-hotplug: update mce_bad_pages when removing the memory

2012-12-12T01:22:22Z

When we hotremove a memory device, we will free the memory to store struct page. If the page is hwpoisoned page, we should decrease mce_bad_pages. [akpm@linux-foundation.org: cleanup ifdefs] Signed-off-by: Wen Congyang Cc: David Rientjes Cc: Jiang Liu Cc: Len Brown Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Christoph Lameter Cc: Minchan Kim Cc: KOSAKI Motohiro Cc: Yasuaki Ishimatsu Cc: Dave Hansen Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/vmemmap: fix wrong use of virt_to_page

2012-11-30T16:51:17Z

I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20. It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is only used for kernel direct mapping address, but sparse-vmemmap uses vmemmap address, so it is going wrong here. ------------[ cut here ]------------ kernel BUG at arch/x86/mm/physaddr.c:20! invalid opcode: 0000 [#1] SMP Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod CPU 39 Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R RIP: 0010:[] [] __phys_addr+0x88/0x90 RSP: 0018:ffff8804440d7c08 EFLAGS: 00010006 RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c ... Signed-off-by: Jianguo Wu Signed-off-by: Jiang Liu Reviewd-by: Wen Congyang Acked-by: Johannes Weiner Reviewed-by: Yasuaki Ishimatsu Reviewed-by: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/sparse: remove index_init_lock

2012-08-01T01:42:49Z

sparse_index_init() uses the index_init_lock spinlock to protect root mem_section assignment. The lock is not necessary anymore because the function is called only during boot (during paging init which is executed only from a single CPU) and from the hotplug code (by add_memory() via arch_add_memory()) which uses mem_hotplug_mutex. The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug preparation") and sparse_index_init() was used only during boot at that time. Later when the hotplug code (and add_memory()) was introduced there was no synchronization so it was possible to online more sections from the same root probably (though I am not 100% sure about that). The first synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and hibernation in the same kernel") which was later replaced by the mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce {un}lock_memory_hotplug()"). Let's remove the lock as it is not needed and it makes the code more confusing. [mhocko@suse.cz: changelog] Signed-off-by: Gavin Shan Reviewed-by: Michal Hocko Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/sparse: more checks on mem_section number

2012-08-01T01:42:49Z

__section_nr() was implemented to retrieve the corresponding memory section number according to its descriptor. It's possible that the specified memory section descriptor doesn't exist in the global array. So add more checking on that and report an error for a wrong case. Signed-off-by: Gavin Shan Acked-by: David Rientjes Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/sparse: optimize sparse_index_alloc

2012-08-01T01:42:49Z

With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section descriptors are allocated from slab or bootmem. When allocating from slab, let slab/bootmem allocator clear the memory chunk. We needn't clear it explicitly. Signed-off-by: Gavin Shan Reviewed-by: Michal Hocko Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: setup pageblock_order before it's used by sparsemem

2012-08-01T01:42:43Z

On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as Itanium, pageblock_order is a variable with default value of 0. It's set to the right value by set_pageblock_order() in function free_area_init_core(). But pageblock_order may be used by sparse_init() before free_area_init_core() is called along path: sparse_init() ->sparse_early_usemaps_alloc_node() ->usemap_size() ->SECTION_BLOCKFLAGS_BITS ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) The uninitialized pageblock_size will cause memory wasting because usemap_size() returns a much bigger value then it's really needed. For example, on an Itanium platform, sparse_init() pageblock_order=0 usemap_size=24576 free_area_init_core() before pageblock_order=0, usemap_size=24576 free_area_init_core() after pageblock_order=12, usemap_size=8 That means 24K memory has been wasted for each section, so fix it by calling set_pageblock_order() from sparse_init(). Signed-off-by: Xishi Qiu Signed-off-by: Jiang Liu Cc: Tony Luck Cc: Yinghai Lu Cc: KAMEZAWA Hiroyuki Cc: Benjamin Herrenschmidt Cc: KOSAKI Motohiro Cc: David Rientjes Cc: Keping Chen Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: sparse: fix usemap allocation above node descriptor section

2012-07-11T23:04:49Z

After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint in alloc_bootmem_section"), usemap allocations may easily be placed outside the optimal section that holds the node descriptor, even if there is space available in that section. This results in unnecessary hotplug dependencies that need to have the node unplugged before the section holding the usemap. The reason is that the bootmem allocator doesn't guarantee a linear search starting from the passed allocation goal but may start out at a much higher address absent an upper limit. Fix this by trying the allocation with the limit at the section end, then retry without if that fails. This keeps the fix from f5bf18fa22f8 of not panicking if the allocation does not fit in the section, but still makes sure to try to stay within the section at first. Signed-off-by: Yinghai Lu Signed-off-by: Johannes Weiner Cc: [3.3.x, 3.4.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: sparse: fix section usemap placement calculation

2012-07-11T23:04:49Z

Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the bootmem allocator") introduced a bug in the allocation goal calculation that put section usemaps not in the same section as the node descriptors, creating unnecessary hotplug dependencies between them: node 0 must be removed before remove section 16399 node 1 must be removed before remove section 16399 node 2 must be removed before remove section 16399 node 3 must be removed before remove section 16399 node 4 must be removed before remove section 16399 node 5 must be removed before remove section 16399 node 6 must be removed before remove section 16399 The reason is that it applies PAGE_SECTION_MASK to the physical address of the node descriptor when finding a suitable place to put the usemap, when this mask is actually intended to be used with PFNs. Because the PFN mask is wider, the target address will point beyond the wanted section holding the node descriptor and the node must be offlined before the section holding the usemap can go. Fix this by extending the mask to address width before use. Signed-off-by: Yinghai Lu Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: remove sparsemem allocation details from the bootmem allocator

2012-05-29T23:22:22Z

alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: Johannes Weiner Acked-by: Tejun Heo Acked-by: David S. Miller Cc: Yinghai Lu Cc: Gavin Shan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds