linux/mm/backing-dev.c, branch v4.9

block: fix bdi vs gendisk lifetime mismatch

2016-08-04T20:19:16Z

The name for a bdi of a gendisk is derived from the gendisk's devt. However, since the gendisk is destroyed before the bdi it leaves a window where a new gendisk could dynamically reuse the same devt while a bdi with the same name is still live. Arrange for the bdi to hold a reference against its "owner" disk device while it is registered. Otherwise we can hit sysfs duplicate name collisions like the following: WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1' Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015 0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351 0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8 Call Trace: [] dump_stack+0x63/0x87 [] __warn+0xd1/0xf0 [] warn_slowpath_fmt+0x5f/0x80 [] sysfs_warn_dup+0x64/0x80 [] sysfs_create_dir_ns+0x7e/0x90 [] kobject_add_internal+0xaa/0x320 [] ? vsnprintf+0x34e/0x4d0 [] kobject_add+0x75/0xd0 [] ? mutex_lock+0x12/0x2f [] device_add+0x125/0x610 [] device_create_groups_vargs+0xd8/0x100 [] device_create_vargs+0x1c/0x20 [] bdi_register+0x8c/0x180 [] bdi_register_dev+0x27/0x30 [] add_disk+0x175/0x4a0 Cc: Reported-by: Yi Zhang Tested-by: Yi Zhang Signed-off-by: Dan Williams Fixed up missing 0 return in bdi_register_owner(). Signed-off-by: Jens Axboe

mm, vmscan: move LRU lists to node

2016-07-28T23:07:41Z

This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Hillf Danton Cc: Joonsoo Kim Cc: Michal Hocko Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: throttle on IO only when there are too many dirty and writeback pages

2016-05-21T00:58:30Z

wait_iff_congested has been used to throttle allocator before it retried another round of direct reclaim to allow the writeback to make some progress and prevent reclaim from looping over dirty/writeback pages without making any progress. We used to do congestion_wait before commit 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") but that led to undesirable stalls and sleeping for the full timeout even when the BDI wasn't congested. Hence wait_iff_congested was used instead. But it seems that even wait_iff_congested doesn't work as expected. We might have a small file LRU list with all pages dirty/writeback and yet the bdi is not congested so this is just a cond_resched in the end and can end up triggering pre mature OOM. This patch replaces the unconditional wait_iff_congested by congestion_wait which is executed only if we _know_ that the last round of direct reclaim didn't make any progress and dirty+writeback pages are more than a half of the reclaimable pages on the zone which might be usable for our target allocation. This shouldn't reintroduce stalls fixed by 0e093d99763e because congestion_wait is called only when we are getting hopeless when sleeping is a better choice than OOM with many pages under IO. We have to preserve logic introduced by commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") into the __alloc_pages_slowpath now that wait_iff_congested is not used anymore. As the only remaining user of wait_iff_congested is shrink_inactive_list we can remove the WQ specific short sleep from wait_iff_congested because the sleep is needed to be done only once in the allocation retry cycle. [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly] Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Hillf Danton Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tetsuo Handa Cc: Vladimir Davydov Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

writeback: fix the wrong congested state variable definition

2016-03-31T18:26:25Z

The right variable definition should be wb_congested_state that include WB_async_congested and WB_sync_congested. So fix it. Signed-off-by: Kaixu Xia Acked-by: Tejun Heo Signed-off-by: Jens Axboe

mm: convert printk(KERN_ to pr_

2016-03-17T22:09:34Z

Most of the mm subsystem uses pr_ so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches Acked-by: Tejun Heo [percpu] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/backing-dev.c: fix error path in wb_init()

2016-02-12T02:35:48Z

We need to use post-decrement to get percpu_counter_destroy() called on &wb->stat[0]. Moreover, the pre-decremebt would cause infinite out-of-bounds accesses if the setup code failed at i==0. Signed-off-by: Rasmus Villemoes Cc: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress

2016-02-06T02:10:40Z

Jan Stancek has reported that system occasionally hanging after "oom01" testcase from LTP triggers OOM. Guessing from a result that there is a kworker thread doing memory allocation and the values between "Node 0 Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not up-to-date for some reason. According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress"), it meant to force the kworker thread to take a short sleep, but it by error used schedule_timeout(1). We missed that schedule_timeout() in state TASK_RUNNING doesn't do anything. Fix it by using schedule_timeout_uninterruptible(1) which forces the kworker thread to take a short sleep in order to make sure that vmstat is up-to-date. Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa Reported-by: Jan Stancek Acked-by: Michal Hocko Cc: Tejun Heo Cc: Cristopher Lameter Cc: Joonsoo Kim Cc: Arkadiusz Miskiewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: memcontrol: export root_mem_cgroup

2016-01-15T00:00:49Z

A later patch will need this symbol in files other than memcontrol.c, so export it now and replace mem_cgroup_root_css at the same time. Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Acked-by: David S. Miller Reviewed-by: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress

2015-12-12T18:15:34Z

Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Signed-off-by: Michal Hocko Reported-by: Tetsuo Handa Cc: Tejun Heo Cc: Cristopher Lameter Cc: Joonsoo Kim Cc: Arkadiusz Miskiewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

2015-11-07T01:50:42Z

__GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: Christoph Lameter Cc: David Rientjes Cc: Vitaly Wool Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds