Skip to content
Snippets Groups Projects
  1. Aug 17, 2018
  2. Jun 08, 2018
  3. May 26, 2018
    • Jonathan Cameron's avatar
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron authored
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      
      
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
  4. Apr 11, 2018
    • Michal Hocko's avatar
      mm: unclutter THP migration · 94723aaf
      Michal Hocko authored
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • Michal Hocko's avatar
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko authored
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
  5. Apr 06, 2018
    • Mike Rapoport's avatar
    • Pavel Tatashin's avatar
      mm/memory_hotplug: optimize memory hotplug · d0dc12e8
      Pavel Tatashin authored
      During memory hotplugging we traverse struct pages three times:
      
      1. memset(0) in sparse_add_one_section()
      2. loop in __add_section() to set do: set_page_node(page, nid); and
         SetPageReserved(page);
      3. loop in memmap_init_zone() to call __init_single_pfn()
      
      This patch removes the first two loops, and leaves only loop 3.  All
      struct pages are initialized in one place, the same as it is done during
      boot.
      
      The benefits:
      
       - We improve memory hotplug performance because we are not evicting the
         cache several times and also reduce loop branching overhead.
      
       - Remove condition from hotpath in __init_single_pfn(), that was added
         in order to fix the problem that was reported by Bharata in the above
         email thread, thus also improve performance during normal boot.
      
       - Make memory hotplug more similar to the boot memory initialization
         path because we zero and initialize struct pages only in one
         function.
      
       - Simplifies memory hotplug struct page initialization code, and thus
         enables future improvements, such as multi-threading the
         initialization of struct pages in order to improve hotplug
         performance even further on larger machines.
      
      [pasha.tatashin@oracle.com: v5]
        Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0dc12e8
    • Pavel Tatashin's avatar
      mm/memory_hotplug: don't read nid from struct page during hotplug · fc44f7f9
      Pavel Tatashin authored
      During memory hotplugging the probe routine will leave struct pages
      uninitialized, the same as it is currently done during boot.  Therefore,
      we do not want to access the inside of struct pages before
      __init_single_page() is called during onlining.
      
      Because during hotplug we know that pages in one memory block belong to
      the same numa node, we can skip the checking.  We should keep checking
      for the boot case.
      
      [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
        Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc44f7f9
    • Pavel Tatashin's avatar
      mm/memory_hotplug: enforce block size aligned range check · ba325585
      Pavel Tatashin authored
      Patch series "optimize memory hotplug", v3.
      
      This patchset:
      
       - Improves hotplug performance by eliminating a number of struct page
         traverses during memory hotplug.
      
       - Fixes some issues with hotplugging, where boundaries were not
         properly checked. And on x86 block size was not properly aligned with
         end of memory
      
       - Also, potentially improves boot performance by eliminating condition
         from __init_single_page().
      
       - Adds robustness by verifying that that struct pages are correctly
         poisoned when flags are accessed.
      
      The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
      2.60GHz with 1T RAM:
      
      booting in qemu with 960G of memory, time to initialize struct pages:
      
      no-kvm:
      	TRY1		TRY2
      BEFORE:	39.433668	39.39705
      AFTER:	36.903781	36.989329
      
      with-kvm:
      BEFORE:	10.977447	11.103164
      AFTER:	10.929072	10.751885
      
      Hotplug 896G memory:
      no-kvm:
      	TRY1		TRY2
      BEFORE: 848.740000	846.910000
      AFTER:  783.070000	786.560000
      
      with-kvm:
      	TRY1		TRY2
      BEFORE: 34.410000	33.57
      AFTER:	29.810000	29.580000
      
      This patch (of 6):
      
      Start qemu with the following arguments:
      
        -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
      
      Which: boots machine with 64G, and adds a device mem1 with 2G which can
      be hotplugged later.
      
      Also make sure that config has the following turned on:
        CONFIG_MEMORY_HOTPLUG
        CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
        CONFIG_ACPI_HOTPLUG_MEMORY
      
      Using the qemu monitor hotplug the memory (make sure config has (qemu)
      device_add pc-dimm,id=dimm1,memdev=mem1
      
      The operation will fail with the following trace:
      
          WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
          pages_correctly_reserved+0xe6/0x110
          Modules linked in:
          CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
          BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
          RIP: 0010:pages_correctly_reserved+0xe6/0x110
          Call Trace:
           memory_subsys_online+0x44/0xa0
           device_online+0x51/0x80
           store_mem_state+0x5e/0xe0
           kernfs_fop_write+0xfa/0x170
           __vfs_write+0x2e/0x150
           vfs_write+0xa8/0x1a0
           SyS_write+0x4d/0xb0
           do_syscall_64+0x5d/0x110
           entry_SYSCALL_64_after_hwframe+0x21/0x86
          ---[ end trace 6203bc4f1a5d30e8 ]---
      
      The problem is detected in: drivers/base/memory.c
      
         static bool pages_correctly_reserved(unsigned long start_pfn)
         205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))
      
      This function loops through every section in the newly added memory
      block and verifies that the first pfn is valid, meaning section exists,
      has mapping (struct page array), and is online.
      
      The block size on x86 is usually 128M, but when machine is booted with
      more than 64G of memory, the block size is changed to 2G: $ cat
      /sys/devices/system/memory/block_size_bytes 80000000
      
      or
      
         $ dmesg | grep "block size"
         [    0.086469] x86/mm: Memory block size: 2048MB
      
      During memory hotplug, and hotremove we verify that the range is section
      size aligned, but we actually must verify that it is block size aligned,
      because that is the proper unit for hotplug operations.  See:
      Documentation/memory-hotplug.txt
      
      So, when the start_pfn of newly added memory is not block size aligned,
      we can get a memory block that has only part of it with properly
      populated sections.
      
      In our case the start_pfn starts from the last_pfn (end of physical
      memory).
      
         $ dmesg | grep last_pfn
         [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
      
      0x1040000 == 65G, and so is not 2G aligned!
      
      The fix is to enforce that memory that is hotplugged and hotremoved is
      block size aligned.
      
      With this fix, running the above sequence yield to the following result:
      
         (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
         Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
         							size 0x80000000
         acpi PNP0C80:00: add_memory failed
         acpi PNP0C80:00: acpi_memory_enable_device() error
         acpi PNP0C80:00: Enumeration failure
      
      Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba325585
  6. Feb 01, 2018
  7. Jan 08, 2018
  8. Nov 16, 2017
    • Fan Du's avatar
    • Michal Hocko's avatar
      mm, memory_hotplug: remove timeout from __offline_memory · ecde0f3e
      Michal Hocko authored
      We have a hardcoded 120s timeout after which the memory offline fails
      basically since the hot remove has been introduced.  This is essentially
      a policy implemented in the kernel.  Moreover there is no way to adjust
      the timeout and so we are sometimes facing memory offline failures if
      the system is under a heavy memory pressure or very intensive CPU
      workload on large machines.
      
      It is not very clear what purpose the timeout actually serves.  The
      offline operation is interruptible by a signal so if userspace wants
      some timeout based termination this can be done trivially by sending a
      signal.
      
      If there is a strong usecase to do this from the kernel then we should
      do it properly and have a it tunable from the userspace with the timeout
      disabled by default along with the explanation who uses it and for what
      purporse.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecde0f3e
    • Michal Hocko's avatar
      mm, memory_hotplug: do not fail offlining too early · 72b39cfc
      Michal Hocko authored
      Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.
      
      While testing memory hotplug on a large 4TB machine we have noticed that
      memory offlining is just too eager to fail.  The primary reason is that
      the retry logic is just too easy to give up.  We have 4 ways out of the
      offline
      
      	- we have a permanent failure (isolation or memory notifiers fail,
      	  or hugetlb pages cannot be dropped)
      	- userspace sends a signal
      	- a hardcoded 120s timeout expires
      	- page migration fails 5 times
      
      This is way too convoluted and it doesn't scale very well.  We have seen
      both temporary migration failures as well as 120s being triggered.
      After removing those restrictions we were able to pass stress testing
      during memory hot remove without any other negative side effects
      observed.  Therefore I suggest dropping both hard coded policies.  I
      couldn't have found any specific reason for them in the changelog.  I
      neither didn't get any response [1] from Kamezawa.  If we need some
      upper bound - e.g.  timeout based - then we should have a proper and
      user defined policy for that.  In any case there should be a clear use
      case when introducing it.
      
      This patch (of 2):
      
      Memory offlining can fail too eagerly under heavy memory pressure.
      
        page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
        flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
        page dumped because: isolation failed
        page->mem_cgroup:ffff8801cd662000
        memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
      
      Isolation has failed here because the page is not on LRU.  Most probably
      because it was on the pcp LRU cache or it has been removed from the LRU
      already but it hasn't been freed yet.  In both cases the page doesn't
      look non-migrable so retrying more makes sense.
      
      __offline_pages seems rather cluttered when it comes to the retry logic.
      We have 5 retries at maximum and a timeout.  We could argue whether the
      timeout makes sense but failing just because of a race when somebody
      isoltes a page from LRU or puts it on a pcp LRU lists is just wrong.  It
      only takes it to race with a process which unmaps some pages and remove
      them from the LRU list and we can fail the whole offline because of
      something that is a temporary condition and actually not harmful for the
      offline.
      
      Please note that unmovable pages should be already excluded during
      start_isolate_page_range.  We could argue that has_unmovable_pages is
      racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
      but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
      pages in most cases and movable zone shouldn't contain unmovable pages
      at all.  Some of those pages might be pinned but not for ever because
      that would be a bug on its own.  In any case the context is still
      interruptible and so the userspace can easily bail out when the
      operation takes too long.  This is certainly better behavior than a
      hardcoded retry loop which is racy.
      
      Fix this by removing the max retry count and only rely on the timeout
      resp. interruption by a signal from the userspace.  Also retry rather
      than fail when check_pages_isolated sees some !free pages because those
      could be a result of the race as well.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b39cfc
  9. Oct 04, 2017
  10. Sep 09, 2017
    • Jérôme Glisse's avatar
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse authored
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
      
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • Naoya Horiguchi's avatar
      mm: memory_hotplug: memory hotremove supports thp migration · 8135d892
      Naoya Horiguchi authored
      This patch enables thp migration for memory hotremove.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
      
      
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135d892
  11. Sep 07, 2017
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko authored
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • Michal Hocko's avatar
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node · 34ad1296
      Michal Hocko authored
      try_online_node calls hotadd_new_pgdat which already calls
      build_all_zonelists.  So the additional call is redundant.  Even though
      hotadd_new_pgdat will only initialize zonelists of the new node this is
      the right thing to do because such a node doesn't have any memory so
      other zonelists would ignore all the zones from this node anyway.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34ad1296
    • Michal Hocko's avatar
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko authored
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • Michal Hocko's avatar
      mm, memory_hotplug: remove zone restrictions · c6f03e29
      Michal Hocko authored
      Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
      to precede the Movable zone in the physical memory range.  The purpose
      of the movable zone is, however, not bound to any physical memory
      restriction.  It merely defines a class of migrateable and reclaimable
      memory.
      
      There are users (e.g.  CMA) who might want to reserve specific physical
      memory ranges for their own purpose.  Moreover our pfn walkers have to
      be prepared for zones overlapping in the physical range already because
      we do support interleaving NUMA nodes and therefore zones can interleave
      as well.  This means we can allow each memory block to be associated
      with a different zone.
      
      Loosen the current onlining semantic and allow explicit onlining type on
      any memblock.  That means that online_{kernel,movable} will be allowed
      regardless of the physical address of the memblock as long as it is
      offline of course.  This might result in moveble zone overlapping with
      other kernel zones.  Default onlining then becomes a bit tricky but
      still sensible.  echo online > memoryXY/state will online the given
      block to
      
      	1) the default zone if the given range is outside of any zone
      	2) the enclosing zone if such a zone doesn't interleave with
      	   any other zone
              3) the default zone if more zones interleave for this range
      
      where default zone is movable zone only if movable_node is enabled
      otherwise it is a kernel zone.
      
      Here is an example of the semantic with (movable_node is not present but
      it work in an analogous way). We start with following memblocks, all of
      them offline:
      
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      Now, we online block 34 in default mode and block 37 as movable
      
        root@test1:/sys/devices/system/node/node1# echo online > memory34/state
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      As we can see all other blocks can still be onlined both into Normal and
      Movable zones and the Normal is default because the Movable zone spans
      only block37 now.
      
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Movable Normal
        memory39/valid_zones:Movable Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Now the default zone for blocks 37-41 has changed because movable zone
      spans that range.
      
        root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Note that the block 39 now belongs to the zone Normal and so block38
      falls into Normal by default as well.
      
      For completness
      
        root@test1:/sys/devices/system/node/node1# for i in memory[34]?
        do
      	echo online > $i/state 2>/dev/null
        done
      
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable
        memory41/valid_zones:Movable
      
      Implementation wise the change is quite straightforward.  We can get rid
      of allow_online_pfn_range altogether.  online_pages allows only offline
      nodes already.  The original default_zone_for_pfn will become
      default_kernel_zone_for_pfn.  New default_zone_for_pfn implements the
      above semantic.  zone_for_pfn_range is slightly reorganized to implement
      kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
      catch all default behavior.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f03e29
    • Michal Hocko's avatar
      mm, memory_hotplug: display allowed zones in the preferred ordering · e5e68930
      Michal Hocko authored
      Prior to commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") we used to allow to change the
      valid zone types of a memory block if it is adjacent to a different zone
      type.
      
      This fact was reflected in memoryNN/valid_zones by the ordering of
      printed zones.  The first one was default (echo online > memoryNN/state)
      and the other one could be onlined explicitly by online_{movable,kernel}.
      
      This behavior was removed by the said patch and as such the ordering was
      not all that important.  In most cases a kernel zone would be default
      anyway.  The only exception is movable_node handled by "mm,
      memory_hotplug: support movable_node for hotpluggable nodes".
      
      Let's reintroduce this behavior again because later patch will remove
      the zone overlap restriction and so user will be allowed to online
      kernel resp.  movable block regardless of its placement.  Original
      behavior will then become significant again because it would be
      non-trivial for users to see what is the default zone to online into.
      
      Implementation is really simple.  Pull out zone selection out of
      move_pfn_range into zone_for_pfn_range helper and use it in
      show_valid_zones to display the zone for default onlining and then both
      kernel and movable if they are allowed.  Default online zone is not
      duplicated.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5e68930
  12. Jul 10, 2017
    • Thomas Gleixner's avatar
      mm/memory-hotplug: switch locking to a percpu rwsem · 3f906ba2
      Thomas Gleixner authored
      Andrey reported a potential deadlock with the memory hotplug lock and
      the cpu hotplug lock.
      
      The reason is that memory hotplug takes the memory hotplug lock and then
      calls stop_machine() which calls get_online_cpus().  That's the reverse
      lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c
      
      The problem has been there forever.  The reason why this was never
      reported is that the cpu hotplug locking had this homebrewn recursive
      reader writer semaphore construct which due to the recursion evaded the
      full lock dep coverage.  The memory hotplug code copied that construct
      verbatim and therefor has similar issues.
      
      Three steps to fix this:
      
      1) Convert the memory hotplug locking to a per cpu rwsem so the
         potential issues get reported proper by lockdep.
      
      2) Lock the online cpus in mem_hotplug_begin() before taking the memory
         hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
         code to avoid recursive locking.
      
      3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
         hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
         by invoking lru_add_drain_all_cpuslocked() instead.
      
      Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
      
      
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f906ba2
    • John Hubbard's avatar
      mm/memory_hotplug.c: remove unused local zone_type from __remove_zone() · a52149f1
      John Hubbard authored
      __remove_zone() sets up up zone_type, but never uses it for anything.
      This does not cause a warning, due to the (necessary) use of
      -Wno-unused-but-set-variable.  However, it's noise, so just delete it.
      
      Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
      
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a52149f1
    • Michal Hocko's avatar
      mm: unify new_node_page and alloc_migrate_target · 8b913238
      Michal Hocko authored
      Commit 394e31d2 ("mem-hotplug: alloc new page from a nearest
      neighbor node when mem-offline") has duplicated a large part of
      alloc_migrate_target with some hotplug specific special casing.
      
      To be more precise it tried to enfore the allocation from a different
      node than the original page.  As a result the two function diverged in
      their shared logic, e.g.  the hugetlb allocation strategy.
      
      Let's unify the two and express different NUMA requirements by the given
      nodemask.  new_node_page will simply exclude the node it doesn't care
      about and alloc_migrate_target will use all the available nodes.
      alloc_migrate_target will then learn to migrate hugetlb pages more
      sanely and use preallocated pool when possible.
      
      Please note that alloc_migrate_target used to call alloc_page resp.
      alloc_pages_current so the memory policy of the current context which is
      quite strange when we consider that it is used in the context of
      alloc_contig_range which just tries to migrate pages which stand in the
      way.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b913238
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Michal Hocko's avatar
      mm, memory_hotplug: simplify empty node mask handling in new_node_page · 7f252f27
      Michal Hocko authored
      new_node_page tries to allocate the target page on a different NUMA node
      than the source page.  This makes sense in most cases during the hotplug
      because we are likely to offline the whole numa node.  But there are
      cases where there are no other nodes to fallback (e.g.  when offlining
      parts of the only existing node) and we have to fallback to allocating
      from the source node.  The current code does that but it can be
      simplified by checking the nmask and updating it before we even try to
      allocate rather than special casing it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f252f27
    • Michal Hocko's avatar
      mm, memory_hotplug: support movable_node for hotpluggable nodes · 9f123ab5
      Michal Hocko authored
      movable_node kernel parameter allows making hotpluggable NUMA nodes to
      put all the hotplugable memory into movable zone which allows more or
      less reliable memory hotremove.  At least this is the case for the NUMA
      nodes present during the boot (see find_zone_movable_pfns_for_nodes).
      
      This is not the case for the memory hotplug, though.
      
      	echo online > /sys/devices/system/memory/memoryXYZ/state
      
      will default to a kernel zone (usually ZONE_NORMAL) unless the
      particular memblock is already in the movable zone range which is not
      the case normally when onlining the memory from the udev rule context
      for a freshly hotadded NUMA node.  The only option currently is to have
      a special udev rule to echo online_movable to all memblocks belonging to
      such a node which is rather clumsy.  Not to mention this is inconsistent
      as well because what ended up in the movable zone during the boot will
      end up in a kernel zone after hotremove & hotadd without special care.
      
      It would be nice to reuse memblock_is_hotpluggable but the runtime
      hotplug doesn't have that information available because the boot and
      hotplug paths are not shared and it would be really non trivial to make
      them use the same code path because the runtime hotplug doesn't play
      with the memblock allocator at all.
      
      Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
      movable_node is enabled and the range doesn't overlap with the existing
      normal zone.  This should provide a reasonable default onlining
      strategy.
      
      Strictly speaking the semantic is not identical with the boot time
      initialization because find_zone_movable_pfns_for_nodes covers only the
      hotplugable range as described by the BIOS/FW.  From my experience this
      is usually a full node though (except for Node0 which is special and
      never goes away completely).  If this turns out to be a problem in the
      real life we can tweak the code to store hotplug flag into memblocks but
      let's keep this simple now.
      
      Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f123ab5
    • Gustavo A. R. Silva's avatar
      mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference · dbac61a3
      Gustavo A. R. Silva authored
      The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
      might be NULL.
      
      rollback_node_hotadd() dereferences this pointer.  Add NULL check to
      avoid a potential NULL pointer dereference.
      
      Addresses-Coverity-ID: 1369133
      Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
      
      
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbac61a3
Loading