Skip to content
Snippets Groups Projects
  1. Feb 17, 2022
    • Hugh Dickins's avatar
      mm/munlock: maintain page->mlock_count while unevictable · 07ca7606
      Hugh Dickins authored
      
      Previous patches have been preparatory: now implement page->mlock_count.
      The ordering of the "Unevictable LRU" is of no significance, and there is
      no point holding unevictable pages on a list: place page->mlock_count to
      overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
      which needs to be even so as not to satisfy PageTail - though 2 could be
      added instead of 1 for each mlock, if that's ever an improvement).
      
      But it's only safe to rely on or modify page->mlock_count while lruvec
      lock is held and page is on unevictable "LRU" - we can save lots of edits
      by continuing to pretend that there's an imaginary LRU here (there is an
      unevictable count which still needs to be maintained, but not a list).
      
      The mlock_count technique suffers from an unreliability much like with
      page_mlock(): while someone else has the page off LRU, not much can
      be done.  As before, err on the safe side (behave as if mlock_count 0),
      and let try_to_unlock_one() move the page to unevictable if reclaim finds
      out later on - a few misplaced pages don't matter, what we want to avoid
      is imbalancing reclaim by flooding evictable lists with unevictable pages.
      
      I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
      if we have taken lruvec lock to get the page off its present list, then
      we save everyone trouble (and however many extra atomic ops) by putting
      it on its destination list immediately.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      07ca7606
  2. Feb 12, 2022
    • Roman Gushchin's avatar
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 0764db9b
      Roman Gushchin authored
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      
      
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Tested-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0764db9b
  3. Jan 15, 2022
  4. Jan 06, 2022
    • Vlastimil Babka's avatar
      mm/memcg: Convert slab objcgs from struct page to struct slab · 4b5f8d9a
      Vlastimil Babka authored
      
      page->memcg_data is used with MEMCG_DATA_OBJCGS flag only for slab pages
      so convert all the related infrastructure to struct slab. Also use
      struct folio instead of struct page when resolving object pointers.
      
      This is not just mechanistic changing of types and names. Now in
      mem_cgroup_from_obj() we use folio_test_slab() to decide if we interpret
      the folio as a real slab instead of a large kmalloc, instead of relying
      on MEMCG_DATA_OBJCGS bit that used to be checked in page_objcgs_check().
      Similarly in memcg_slab_free_hook() where we can encounter
      kmalloc_large() pages (here the folio slab flag check is implied by
      virt_to_slab()). As a result, page_objcgs_check() can be dropped instead
      of converted.
      
      To avoid include cycles, move the inline definition of slab_objcgs()
      from memcontrol.h to mm/slab.h.
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <cgroups@vger.kernel.org>
      4b5f8d9a
    • Vlastimil Babka's avatar
      mm: Convert struct page to struct slab in functions used by other subsystems · 40f3bf0c
      Vlastimil Babka authored
      
      KASAN, KFENCE and memcg interact with SLAB or SLUB internals through
      functions nearest_obj(), obj_to_index() and objs_per_slab() that use
      struct page as parameter. This patch converts it to struct slab
      including all callers, through a coccinelle semantic patch.
      
      // Options: --include-headers --no-includes --smpl-spacing include/linux/slab_def.h include/linux/slub_def.h mm/slab.h mm/kasan/*.c mm/kfence/kfence_test.c mm/memcontrol.c mm/slab.c mm/slub.c
      // Note: needs coccinelle 1.1.1 to avoid breaking whitespace
      
      @@
      @@
      
      -objs_per_slab_page(
      +objs_per_slab(
       ...
       )
       { ... }
      
      @@
      @@
      
      -objs_per_slab_page(
      +objs_per_slab(
       ...
       )
      
      @@
      identifier fn =~ "obj_to_index|objs_per_slab";
      @@
      
       fn(...,
      -   const struct page *page
      +   const struct slab *slab
          ,...)
       {
      <...
      (
      - page_address(page)
      + slab_address(slab)
      |
      - page
      + slab
      )
      ...>
       }
      
      @@
      identifier fn =~ "nearest_obj";
      @@
      
       fn(...,
      -   struct page *page
      +   const struct slab *slab
          ,...)
       {
      <...
      (
      - page_address(page)
      + slab_address(slab)
      |
      - page
      + slab
      )
      ...>
       }
      
      @@
      identifier fn =~ "nearest_obj|obj_to_index|objs_per_slab";
      expression E;
      @@
      
       fn(...,
      (
      - slab_page(E)
      + E
      |
      - virt_to_page(E)
      + virt_to_slab(E)
      |
      - virt_to_head_page(E)
      + virt_to_slab(E)
      |
      - page
      + page_slab(page)
      )
        ,...)
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <kasan-dev@googlegroups.com>
      Cc: <cgroups@vger.kernel.org>
      40f3bf0c
  5. Dec 11, 2021
  6. Nov 17, 2021
  7. Nov 11, 2021
  8. Nov 06, 2021
  9. Sep 27, 2021
  10. Sep 23, 2021
    • Shakeel Butt's avatar
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt authored
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104
      
       [3]
      Fixes: aa48e47e ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Tested-by: default avatarMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f828223
  11. Sep 03, 2021
Loading