Skip to content
Snippets Groups Projects
  1. Jul 04, 2021
  2. Jun 05, 2021
    • Roman Gushchin's avatar
      percpu: rework memcg accounting · faf65dde
      Roman Gushchin authored
      
      The current implementation of the memcg accounting of the percpu
      memory is based on the idea of having two separate sets of chunks for
      accounted and non-accounted memory. This approach has an advantage
      of not wasting any extra memory for memcg data for non-accounted
      chunks, however it complicates the code and leads to a higher chunks
      number due to a lower chunk utilization.
      
      Instead of having two chunk types it's possible to declare all* chunks
      memcg-aware unless the kernel memory accounting is disabled globally
      by a boot option. The size of objcg_array is usually small in
      comparison to chunks themselves (it obviously depends on the number of
      CPUs), so even if some chunk will have no accounted allocations, the
      memory waste isn't significant and will likely be compensated by
      a higher chunk utilization. Also, with time more and more percpu
      allocations will likely become accounted.
      
      * The first chunk is initialized before the memory cgroup subsystem,
        so we don't know for sure whether we need to allocate obj_cgroups.
        Because it's small, let's make it free for use. Then we don't need
        to allocate obj_cgroups for it.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      faf65dde
  3. Apr 30, 2021
  4. Apr 21, 2021
    • Roman Gushchin's avatar
      percpu: implement partial chunk depopulation · f1833241
      Roman Gushchin authored
      
      From Roman ("percpu: partial chunk depopulation"):
      In our [Facebook] production experience the percpu memory allocator is
      sometimes struggling with returning the memory to the system. A typical
      example is a creation of several thousands memory cgroups (each has
      several chunks of the percpu data used for vmstats, vmevents,
      ref counters etc). Deletion and complete releasing of these cgroups
      doesn't always lead to a shrinkage of the percpu memory, so that
      sometimes there are several GB's of memory wasted.
      
      The underlying problem is the fragmentation: to release an underlying
      chunk all percpu allocations should be released first. The percpu
      allocator tends to top up chunks to improve the utilization. It means
      new small-ish allocations (e.g. percpu ref counters) are placed onto
      almost filled old-ish chunks, effectively pinning them in memory.
      
      This patchset solves this problem by implementing a partial depopulation
      of percpu chunks: chunks with many empty pages are being asynchronously
      depopulated and the pages are returned to the system.
      
      To illustrate the problem the following script can be used:
      --
      
      cd /sys/fs/cgroup
      
      mkdir percpu_test
      echo "+memory" > percpu_test/cgroup.subtree_control
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          mkdir percpu_test/cg_"${i}"
          for j in `seq 1 10`; do
      	mkdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          for j in `seq 1 10`; do
      	rmdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      sleep 10
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          rmdir percpu_test/cg_"${i}"
      done
      
      rmdir percpu_test
      --
      
      It creates 11000 memory cgroups and removes every 10 out of 11.
      It prints the initial size of the percpu memory, the size after
      creating all cgroups and the size after deleting most of them.
      
      Results:
        vanilla:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481152 kB
          Percpu:           481152 kB
      
        with this patchset applied:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481408 kB
          Percpu:           135552 kB
      
      The total size of the percpu memory was reduced by more than 3.5 times.
      
      This patch:
      
      This patch implements partial depopulation of percpu chunks.
      
      As of now, a chunk can be depopulated only as a part of the final
      destruction, if there are no more outstanding allocations. However
      to minimize a memory waste it might be useful to depopulate a
      partially filed chunk, if a small number of outstanding allocations
      prevents the chunk from being fully reclaimed.
      
      This patch implements the following depopulation process: it scans
      over the chunk pages, looks for a range of empty and populated pages
      and performs the depopulation. To avoid races with new allocations,
      the chunk is previously isolated. After the depopulation the chunk is
      sidelined to a special list or freed. New allocations prefer using
      active chunks to sidelined chunks. If a sidelined chunk is used, it is
      reintegrated to the active lists.
      
      The depopulation is scheduled on the free path if the chunk is all of
      the following:
        1) has more than 1/4 of total pages free and populated
        2) the system has enough free percpu pages aside of this chunk
        3) isn't the reserved chunk
        4) isn't the first chunk
      If it's already depopulated but got free populated pages, it's a good
      target too. The chunk is moved to a special slot,
      pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
      item is scheduled. On isolation, these pages are removed from the
      pcpu_nr_empty_pop_pages. It is constantly replaced to the
      to_depopulate_slot when it meets these qualifications.
      
      pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
      becomes empty. The depopulation is performed in the reverse direction to
      keep populated pages close to the beginning. Depopulated chunks are
      sidelined to preferentially avoid them for new allocations. When no
      active chunk can suffice a new allocation, sidelined chunks are first
      checked before creating a new chunk.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Co-developed-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Tested-by: default avatarPratik Sampat <psampat@linux.ibm.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      f1833241
  5. Aug 12, 2020
    • Roman Gushchin's avatar
      mm: memcg/percpu: account percpu memory to memory cgroups · 3c7be18a
      Roman Gushchin authored
      Percpu memory is becoming more and more widely used by various subsystems,
      and the total amount of memory controlled by the percpu allocator can make
      a good part of the total memory.
      
      As an example, bpf maps can consume a lot of percpu memory, and they are
      created by a user.  Also, some cgroup internals (e.g.  memory controller
      statistics) can be quite large.  On a machine with many CPUs and big
      number of cgroups they can consume hundreds of megabytes.
      
      So the lack of memcg accounting is creating a breach in the memory
      isolation.  Similar to the slab memory, percpu memory should be accounted
      by default.
      
      To implement the perpcu accounting it's possible to take the slab memory
      accounting as a model to follow.  Let's introduce two types of percpu
      chunks: root and memcg.  What makes memcg chunks different is an
      additional space allocated to store memcg membership information.  If
      __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
      If it's possible to charge the corresponding size to the target memory
      cgroup, allocation is performed, and the memcg ownership data is recorded.
      System-wide allocations are performed using root chunks, so there is no
      additional memory overhead.
      
      To implement a fast reparenting of percpu memory on memcg removal, we
      don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
      introduced for slab accounting.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
      [akpm@linux-foundation.org: move unreachable code, per Roman]
      [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
        Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
      
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c7be18a
  6. Jun 05, 2019
  7. Feb 18, 2018
    • Dennis Zhou's avatar
      percpu: allow select gfp to be passed to underlying allocators · 554fef1c
      Dennis Zhou authored
      
      The prior patch added support for passing gfp flags through to the
      underlying allocators. This patch allows users to pass along gfp flags
      (currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying
      allocators. This should allow users to decide if they are ok with
      failing allocations recovering in a more graceful way.
      
      Additionally, gfp passing was done as additional flags in the previous
      patch. Instead, change this to caller passed semantics. GFP_KERNEL is
      also removed as the default flag. It continues to be used for internally
      caused underlying percpu allocations.
      
      V2:
      Removed gfp_percpu_mask in favor of doing it inline.
      Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp.
      
      Signed-off-by: default avatarDennis Zhou <dennisszhou@gmail.com>
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      554fef1c
    • Dennis Zhou's avatar
      percpu: add __GFP_NORETRY semantics to the percpu balancing path · 47504ee0
      Dennis Zhou authored
      Percpu memory using the vmalloc area based chunk allocator lazily
      populates chunks by first requesting the full virtual address space
      required for the chunk and subsequently adding pages as allocations come
      through. To ensure atomic allocations can succeed, a workqueue item is
      used to maintain a minimum number of empty pages. In certain scenarios,
      such as reported in [1], it is possible that physical memory becomes
      quite scarce which can result in either a rather long time spent trying
      to find free pages or worse, a kernel panic.
      
      This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
      through to the underlying allocators. This should prevent any
      unnecessary panics potentially caused by the workqueue item. The passing
      of gfp around is as additional flags rather than a full set of flags.
      The next patch will change these to caller passed semantics.
      
      V2:
      Added const modifier to gfp flags in the balance path.
      Removed an extra whitespace.
      
      [1] https://lkml.org/lkml/2018/2/12/551
      
      
      
      Signed-off-by: default avatarDennis Zhou <dennisszhou@gmail.com>
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reported-by: default avatar <syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      47504ee0
  8. Nov 16, 2017
    • Mel Gorman's avatar
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman authored
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
  9. Jun 29, 2017
  10. Jun 20, 2017
  11. Mar 06, 2017
  12. Sep 02, 2014
    • Tejun Heo's avatar
      percpu: move region iterations out of pcpu_[de]populate_chunk() · a93ace48
      Tejun Heo authored
      
      Previously, pcpu_[de]populate_chunk() were called with the range which
      may contain multiple target regions in it and
      pcpu_[de]populate_chunk() iterated over the regions.  This has the
      benefit of batching up cache flushes for all the regions; however,
      we're planning to add more bookkeeping logic around [de]population to
      support atomic allocations and this delegation of iterations gets in
      the way.
      
      This patch moves the region iterations out of
      pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and
      pcpu_reclaim() - so that we can later add logic to track more states
      around them.  This change may make cache and tlb flushes more frequent
      but multi-region [de]populations are rare anyway and if this actually
      becomes a problem, it's not difficult to factor out cache flushes as
      separate callbacks which are directly invoked from percpu.c.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a93ace48
    • Tejun Heo's avatar
      percpu: move common parts out of pcpu_[de]populate_chunk() · dca49645
      Tejun Heo authored
      
      percpu-vm and percpu-km implement separate versions of
      pcpu_[de]populate_chunk() and some part which is or should be common
      are currently in the specific implementations.  Make the following
      changes.
      
      * Allocate area clearing is moved from the pcpu_populate_chunk()
        implementations to pcpu_alloc().  This makes percpu-km's version
        noop.
      
      * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved
        to their respective callers so that they are applied to percpu-km
        too.  This doesn't make any meaningful difference as both functions
        are noop for percpu-km; however, this is more consistent and will
        help implementing atomic allocation support.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      dca49645
    • Tejun Heo's avatar
      percpu: remove @may_alloc from pcpu_get_pages() · cdb4cba5
      Tejun Heo authored
      
      pcpu_get_pages() creates the temp pages array if not already allocated
      and returns the pointer to it.  As the function is called from both
      [de]population paths and depopulation can only happen after at least
      one successful population, the param doesn't make any difference - the
      allocation will always happen on the population path anyway.
      
      Remove @may_alloc from pcpu_get_pages().  Also, add an lockdep
      assertion pcpu_alloc_mutex instead of vaguely stating that the
      exclusion is the caller's responsibility.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      cdb4cba5
    • Tejun Heo's avatar
      percpu: remove the usage of separate populated bitmap in percpu-vm · fbbb7f4e
      Tejun Heo authored
      
      percpu-vm uses pcpu_get_pages_and_bitmap() to acquire temp pages array
      and populated bitmap and uses the two during [de]population.  The temp
      bitmap is used only to build the new bitmap that is copied to
      chunk->populated after the operation succeeds; however, the new bitmap
      can be trivially set after success without using the temp bitmap.
      
      This patch removes the temp populated bitmap usage from percpu-vm.c.
      
      * pcpu_get_pages_and_bitmap() is renamed to pcpu_get_pages() and no
        longer hands out the temp bitmap.
      
      * @populated arugment is dropped from all the related functions.
        @populated updates in pcpu_[un]map_pages() are dropped.
      
      * Two loops in pcpu_map_pages() are merged.
      
      * pcpu_[de]populated_chunk() modify chunk->populated bitmap directly
        from @page_start and @page_end after success.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      fbbb7f4e
  13. Aug 15, 2014
    • Tejun Heo's avatar
      percpu: perform tlb flush after pcpu_map_pages() failure · 849f5169
      Tejun Heo authored
      
      If pcpu_map_pages() fails midway, it unmaps the already mapped pages.
      Currently, it doesn't flush tlb after the partial unmapping.  This may
      be okay in most cases as the established mapping hasn't been used at
      that point but it can go wrong and when it goes wrong it'd be
      extremely difficult to track down.
      
      Flush tlb after the partial unmapping.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      849f5169
    • Tejun Heo's avatar
      percpu: fix pcpu_alloc_pages() failure path · f0d27965
      Tejun Heo authored
      
      When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to
      free what has already been allocated.  The invocation is across the
      whole requested range and pcpu_free_pages() will try to free all
      non-NULL pages; unfortunately, this is incorrect as
      pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't
      clear the pages array and thus the array may have entries from the
      previous invocations making the partial failure path free incorrect
      pages.
      
      Fix it by open-coding the partial freeing of the already allocated
      pages.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      f0d27965
  14. Jun 20, 2012
  15. Jan 20, 2012
  16. Nov 22, 2011
    • Tejun Heo's avatar
      percpu: fix chunk range calculation · a855b84c
      Tejun Heo authored
      
      Percpu allocator recorded the cpus which map to the first and last
      units in pcpu_first/last_unit_cpu respectively and used them to
      determine the address range of a chunk - e.g. it assumed that the
      first unit has the lowest address in a chunk while the last unit has
      the highest address.
      
      This simply isn't true.  Groups in a chunk can have arbitrary positive
      or negative offsets from the previous one and there is no guarantee
      that the first unit occupies the lowest offset while the last one the
      highest.
      
      Fix it by actually comparing unit offsets to determine cpus occupying
      the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu
      to pcpu_low/high_unit_cpu to avoid confusion.
      
      The chunk address range is used to flush cache on vmalloc area
      map/unmap and decide whether a given address is in the first chunk by
      per_cpu_ptr_to_phys() and the bug was discovered by invalid
      per_cpu_ptr_to_phys() translation for crash_note.
      
      Kudos to Dave Young for tracking down the problem.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Reported-by: default avatarDave Young <dyoung@redhat.com>
      Tested-by: default avatarDave Young <dyoung@redhat.com>
      LKML-Reference: <4EC21F67.10905@redhat.com>
      Cc: stable @kernel.org
      a855b84c
    • Bob Liu's avatar
      percpu: rename pcpu_mem_alloc to pcpu_mem_zalloc · 90459ce0
      Bob Liu authored
      
      Currently pcpu_mem_alloc() is implemented always return zeroed memory.
      So rename it to make user like pcpu_get_pages_and_bitmap() know don't
      reinit it.
      
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      90459ce0
  17. Jan 14, 2011
  18. May 01, 2010
    • Tejun Heo's avatar
      percpu: move vmalloc based chunk management into percpu-vm.c · 9f645532
      Tejun Heo authored
      
      Separate out and move chunk management (creation/desctruction and
      [de]population) code into percpu-vm.c which is included by percpu.c
      and compiled together.  The interface for chunk management is defined
      as follows.
      
       * pcpu_populate_chunk		- populate the specified range of a chunk
       * pcpu_depopulate_chunk	- depopulate the specified range of a chunk
       * pcpu_create_chunk		- create a new chunk
       * pcpu_destroy_chunk		- destroy a chunk, always preceded by full depop
       * pcpu_addr_to_page		- translate address to physical address
       * pcpu_verify_alloc_info	- check alloc_info is acceptable during init
      
      Other than wrapping vmalloc_to_page() inside pcpu_addr_to_page() and
      dummy pcpu_verify_alloc_info() implementation, this patch only moves
      code around.  This separation is to allow alternate chunk management
      implementation.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      9f645532
Loading