Skip to content
Snippets Groups Projects
  1. Oct 14, 2020
  2. Aug 15, 2020
  3. Aug 12, 2020
    • Randy Dunlap's avatar
    • Alex Shi's avatar
      mm/compaction: correct the comments of compact_defer_shift · 860b3272
      Alex Shi authored
      
      There is no compact_defer_limit. It should be compact_defer_shift in
      use. and add compact_order_failed explanation.
      
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      860b3272
    • Nitin Gupta's avatar
      mm: use unsigned types for fragmentation score · d34c0a75
      Nitin Gupta authored
      
      Proactive compaction uses per-node/zone "fragmentation score" which is
      always in range [0, 100], so use unsigned type of these scores as well as
      for related constants.
      
      Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d34c0a75
    • Nitin Gupta's avatar
      mm: fix compile error due to COMPACTION_HPAGE_ORDER · 25788738
      Nitin Gupta authored
      
      Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
      HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
      is to check for CONFIG_HUGETLBFS.
      
      Reported-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25788738
    • Nitin Gupta's avatar
      mm: proactive compaction · facdaa91
      Nitin Gupta authored
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/
      
      
      
      Signed-off-by: default avatarNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: default avatarOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
  4. Jun 26, 2020
    • Vlastimil Babka's avatar
      mm, compaction: make capture control handling safe wrt interrupts · b9e20f0d
      Vlastimil Babka authored
      Hugh reports:
      
       "While stressing compaction, one run oopsed on NULL capc->cc in
        __free_one_page()'s task_capc(zone): compact_zone_order() had been
        interrupted, and a page was being freed in the return from interrupt.
      
        Though you would not expect it from the source, both gccs I was using
        (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
        ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
        callq compact_zone - long after the "current->capture_control =
        &capc". An interrupt in between those finds capc->cc NULL (zeroed by
        an earlier rep stos).
      
        This could presumably be fixed by a barrier() before setting
        current->capture_control in compact_zone_order(); but would also need
        more care on return from compact_zone(), in order not to risk leaking
        a page captured by interrupt just before capture_control is reset.
      
        Maybe that is the preferable fix, but I felt safer for task_capc() to
        exclude the rather surprising possibility of capture at interrupt
        time"
      
      I have checked that gcc10 also behaves the same.
      
      The advantage of fix in compact_zone_order() is that we don't add
      another test in the page freeing hot path, and that it might prevent
      future problems if we stop exposing pointers to uninitialized structures
      in current task.
      
      So this patch implements the suggestion for compact_zone_order() with
      barrier() (and WRITE_ONCE() to prevent store tearing) for setting
      current->capture_control, and prevents page leaking with
      WRITE_ONCE/READ_ONCE in the proper order.
      
      Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
      
      
      Fixes: 5e1f0f09 ("mm, compaction: capture a page under direct compaction")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Li Wang <liwang@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[5.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9e20f0d
  5. Jun 05, 2020
  6. Jun 04, 2020
  7. May 28, 2020
    • Ingo Molnar's avatar
      mm/swap: Use local_lock for protection · b01b2141
      Ingo Molnar authored
      
      The various struct pagevec per CPU variables are protected by disabling
      either preemption or interrupts across the critical sections. Inside
      these sections spinlocks have to be acquired.
      
      These spinlocks are regular spinlock_t types which are converted to
      "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
      locks cannot be acquired in preemption or interrupt disabled sections.
      
      local locks provide a trivial way to substitute preempt and interrupt
      disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
      to preempt_disable() and local_lock_irq() to local_irq_disable().
      
      Create lru_rotate_pvecs containing the pagevec and the locallock.
      Create lru_pvecs containing the remaining pagevecs and the locallock.
      Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
      exporting the pvec structure.
      
      Change the relevant call sites to acquire these locks instead of using
      preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
      local_irq_save().
      
      There is neither a functional change nor a change in the generated
      binary code for non PREEMPT_RT enabled non-debug kernels.
      
      When lockdep is enabled local locks have lockdep maps embedded. These
      allow lockdep to validate the protections, i.e. inappropriate usage of a
      preemption only protected sections would result in a lockdep warning
      while the same problem would not be noticed with a plain
      preempt_disable() based protection.
      
      local locks also improve readability as they provide a named scope for
      the protections while preempt/interrupt disable are opaque scopeless.
      
      Finally local locks allow PREEMPT_RT to substitute them with real
      locking primitives to ensure the correctness of operation in a fully
      preemptible kernel.
      
      [ bigeasy: Adopted to use local_lock ]
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
      b01b2141
  8. Apr 27, 2020
  9. Apr 07, 2020
  10. Apr 02, 2020
  11. Oct 14, 2019
  12. Sep 24, 2019
  13. Aug 03, 2019
    • Mel Gorman's avatar
      mm: compaction: avoid 100% CPU usage during compaction when a task is killed · 670105a2
      Mel Gorman authored
      "howaboutsynergy" reported via kernel buzilla number 204165 that
      compact_zone_order was consuming 100% CPU during a stress test for
      prolonged periods of time.  Specifically the following command, which
      should exit in 10 seconds, was taking an excessive time to finish while
      the CPU was pegged at 100%.
      
        stress -m 220 --vm-bytes 1000000000 --timeout 10
      
      Tracing indicated a pattern as follows
      
                stress-3923  [007]   519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
                stress-3923  [007]   519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
      
      Note that compaction is entered in rapid succession while scanning and
      isolating nothing.  The problem is that when a task that is compacting
      receives a fatal signal, it retries indefinitely instead of exiting
      while making no progress as a fatal signal is pending.
      
      It's not easy to trigger this condition although enabling zswap helps on
      the basis that the timing is altered.  A very small window has to be hit
      for the problem to occur (signal delivered while compacting and
      isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).
      
      This was reproduced locally -- 16G single socket system, 8G swap, 30%
      zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
      implementation from github running in a loop until the problem hits).
      Tracing recorded the problem occurring almost 200K times in a short
      window.  With this patch, the problem hit 4 times but the task existed
      normally instead of consuming CPU.
      
      This problem has existed for some time but it was made worse by commit
      cf66f070 ("mm, compaction: do not consider a need to reschedule as
      contention").  Before that commit, if the same condition was hit then
      locks would be quickly contended and compaction would exit that way.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
      Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
      
      
      Fixes: cf66f070 ("mm, compaction: do not consider a need to reschedule as contention")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      670105a2
  14. Jun 01, 2019
    • Suzuki K Poulose's avatar
      mm, compaction: make sure we isolate a valid PFN · e577c8b6
      Suzuki K Poulose authored
      When we have holes in a normal memory zone, we could endup having
      cached_migrate_pfns which may not necessarily be valid, under heavy memory
      pressure with swapping enabled ( via __reset_isolation_suitable(),
      triggered by kswapd).
      
      Later if we fail to find a page via fast_isolate_freepages(), we may end
      up using the migrate_pfn we started the search with, as valid page.  This
      could lead to accessing NULL pointer derefernces like below, due to an
      invalid mem_section pointer.
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
       Mem abort info:
         ESR = 0x96000004
         Exception class = DABT (current EL), IL = 32 bits
         SET = 0, FnV = 0
         EA = 0, S1PTW = 0
       Data abort info:
         ISV = 0, ISS = 0x00000004
         CM = 0, WnR = 0
       user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
       [0000000000000008] pgd=0000000000000000
       Internal error: Oops: 96000004 [#1] SMP
       ...
       CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
       Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
       pstate: 60000005 (nZCv daif -PAN -UAO)
       pc : set_pfnblock_flags_mask+0x58/0xe8
       lr : compaction_alloc+0x300/0x950
       [...]
       Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
       Call trace:
        set_pfnblock_flags_mask+0x58/0xe8
        compaction_alloc+0x300/0x950
        migrate_pages+0x1a4/0xbb0
        compact_zone+0x750/0xde8
        compact_zone_order+0xd8/0x118
        try_to_compact_pages+0xb4/0x290
        __alloc_pages_direct_compact+0x84/0x1e0
        __alloc_pages_nodemask+0x5e0/0xe18
        alloc_pages_vma+0x1cc/0x210
        do_huge_pmd_anonymous_page+0x108/0x7c8
        __handle_mm_fault+0xdd4/0x1190
        handle_mm_fault+0x114/0x1c0
        __get_user_pages+0x198/0x3c0
        get_user_pages_unlocked+0xb4/0x1d8
        __gfn_to_pfn_memslot+0x12c/0x3b8
        gfn_to_pfn_prot+0x4c/0x60
        kvm_handle_guest_abort+0x4b0/0xcd8
        handle_exit+0x140/0x1b8
        kvm_arch_vcpu_ioctl_run+0x260/0x768
        kvm_vcpu_ioctl+0x490/0x898
        do_vfs_ioctl+0xc4/0x898
        ksys_ioctl+0x8c/0xa0
        __arm64_sys_ioctl+0x28/0x38
        el0_svc_common+0x74/0x118
        el0_svc_handler+0x38/0x78
        el0_svc+0x8/0xc
       Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
       ---[ end trace af6a35219325a9b6 ]---
      
      The issue was reported on an arm64 server with 128GB with holes in the
      zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
      running 100 KVM guest instances.
      
      This patch fixes the issue by ensuring that the page belongs to a valid
      PFN when we fallback to using the lower limit of the scan range upon
      failure in fast_isolate_freepages().
      
      Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
      
      
      Fixes: 5a811889 ("mm, compaction: use free lists to quickly locate a migration target")
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Reported-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e577c8b6
  15. May 18, 2019
    • Mel Gorman's avatar
      mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock · 60fce36a
      Mel Gorman authored
      syzbot reported the following error from a tree with a head commit of
      baf76f0c ("slip: make slhc_free() silently accept an error pointer")
      
        BUG: unable to handle kernel paging request at ffffea0003348000
        #PF error: [normal kernel read fault]
        PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
        RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
        RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
        Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
        ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
        c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
        RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
        RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
        RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
        RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
        R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
        FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
        Call Trace:
         fast_isolate_around mm/compaction.c:1243 [inline]
         fast_isolate_freepages mm/compaction.c:1418 [inline]
         isolate_freepages mm/compaction.c:1438 [inline]
         compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550
      
      There is no reproducer and it is difficult to hit -- 1 crash every few
      days.  The issue is very similar to the fix in commit 6b0868c8
      ("mm/compaction.c: correct zone boundary handling when resetting pageblock
      skip hints").  When isolating free pages around a target pageblock, the
      boundary handling is off by one and can stray into the next pageblock.
      Triggering the syzbot error requires that the end of pageblock is section
      or zone aligned, and that the next section is unpopulated.
      
      A more subtle consequence of the bug is that pageblocks were being
      improperly used as migration targets which potentially hurts fragmentation
      avoidance in the long-term one page at a time.
      
      A debugging patch revealed that it's definitely possible to stray outside
      of a pageblock which is not intended.  While syzbot cannot be used to
      verify this patch, it was confirmed that the debugging warning no longer
      triggers with this patch applied.  It has also been confirmed that the THP
      allocation stress tests are not degraded by this patch.
      
      Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
      
      
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatar <syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60fce36a
  16. May 15, 2019
  17. May 14, 2019
  18. Apr 04, 2019
    • Qian Cai's avatar
      mm/compaction.c: abort search if isolation fails · 5b56d996
      Qian Cai authored
      Running LTP oom01 in a tight loop or memory stress testing put the system
      in a low-memory situation could triggers random memory corruption like
      page flag corruption below due to in fast_isolate_freepages(), if
      isolation fails, next_search_order() does not abort the search immediately
      could lead to improper accesses.
      
      UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
      index 7 is out of range for type 'zone [5]'
      Call Trace:
       dump_stack+0x62/0x9a
       ubsan_epilogue+0xd/0x7f
       __ubsan_handle_out_of_bounds+0x14d/0x192
       __isolate_free_page+0x52c/0x600
       compaction_alloc+0x886/0x25f0
       unmap_and_move+0x37/0x1e70
       migrate_pages+0x2ca/0xb20
       compact_zone+0x19cb/0x3620
       kcompactd_do_work+0x2df/0x680
       kcompactd+0x1d8/0x6c0
       kthread+0x32c/0x3f0
       ret_from_fork+0x35/0x40
      ------------[ cut here ]------------
      kernel BUG at mm/page_alloc.c:3124!
      invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      RIP: 0010:__isolate_free_page+0x464/0x600
      RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
      RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
      RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
      RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
      R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
      FS:  0000000000000000(0000) GS:ffff888ba8e80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       compaction_alloc+0x886/0x25f0
       unmap_and_move+0x37/0x1e70
       migrate_pages+0x2ca/0xb20
       compact_zone+0x19cb/0x3620
       kcompactd_do_work+0x2df/0x680
       kcompactd+0x1d8/0x6c0
       kthread+0x32c/0x3f0
       ret_from_fork+0x35/0x40
      
      Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
      
      
      Fixes: dbe2d4e4 ("mm, compaction: round-robin the order while searching the free lists for a target")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      5b56d996
    • Mel Gorman's avatar
      mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints · 6b0868c8
      Mel Gorman authored
      Mikhail Gavrilo reported the following bug being triggered in a Fedora
      kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.
      
       kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       kernel: ------------[ cut here ]------------
       kernel: kernel BUG at include/linux/mm.h:1021!
       kernel: invalid opcode: 0000 [#1] SMP NOPTI
       kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G         C        5.1.0-0.rc1.git1.3.fc31.x86_64 #1
       kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
       kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
       kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
       kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
       kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
       kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
       kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
       kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
       kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
       kernel: FS:  0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
       kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
       kernel: Call Trace:
       kernel:  __reset_isolation_suitable+0x62/0x120
       kernel:  reset_isolation_suitable+0x3b/0x40
       kernel:  kswapd+0x147/0x540
       kernel:  ? finish_wait+0x90/0x90
       kernel:  kthread+0x108/0x140
       kernel:  ? balance_pgdat+0x560/0x560
       kernel:  ? kthread_park+0x90/0x90
       kernel:  ret_from_fork+0x27/0x50
      
      He bisected it down to e332f741 ("mm, compaction: be selective about
      what pageblocks to clear skip hints").  The problem is that the patch in
      question was sloppy with respect to the handling of zone boundaries.  In
      some instances, it was possible for PFNs outside of a zone to be examined
      and if those were not properly initialised or poisoned then it would
      trigger the VM_BUG_ON.  This patch corrects the zone boundary issues when
      resetting pageblock skip hints and Mikhail reported that the bug did not
      trigger after 30 hours of testing.
      
      Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
      
      
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      6b0868c8
  19. Mar 06, 2019
Loading