Skip to content
Snippets Groups Projects
  1. Feb 04, 2020
    • David Hildenbrand's avatar
      mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone() · d33695b1
      David Hildenbrand authored
      Let's poison the pages similar to when adding new memory in
      sparse_add_section().  Also call remove_pfn_range_from_zone() from
      memunmap_pages(), so we can poison the memmap from there as well.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d33695b1
    • Aneesh Kumar K.V's avatar
      mm/memmap_init: update variable name in memmap_init_zone · 1f8d75c1
      Aneesh Kumar K.V authored
      Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect the
      ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
      code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of DIMMs
      at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
      :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  98304
              present  65536
              managed  65536
      :/# echo 0 > /sys/devices/system/memory/memory43/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  32768
              present  32768
              managed  32768
      :/# echo 0 > /sys/devices/system/memory/memory41/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      
      This patch (of 6):
      
      The third argument is actually number of pages.  Change the variable name
      from size to nr_pages to indicate this better.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f8d75c1
    • David Hildenbrand's avatar
      mm: factor out next_present_section_nr() · 4c605881
      David Hildenbrand authored
      Let's move it to the header and use the shorter variant from
      mm/page_alloc.c (the original one will also check
      "__highest_present_section_nr + 1", which is not necessary).  While at
      it, make the section_nr in next_pfn() const.
      
      In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
      exceed __highest_present_section_nr, which doesn't make a difference in
      the caller as it is big enough (>= all sane end_pfn).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c605881
    • David Hildenbrand's avatar
      mm/page_alloc: fix and rework pfn handling in memmap_init_zone() · 948c436e
      David Hildenbrand authored
      Let's update the pfn manually whenever we continue the loop.  This makes
      the code easier to read but also less error prone (and we can directly fix
      one issue).
      
      When overlap_memmap_init() returns true, pfn is updated to
      "memblock_region_memory_end_pfn(r)".  So it already points at the *next*
      pfn to process.  Incrementing the pfn another time is wrong, we might
      leave one uninitialized.  I spotted this by inspecting the code, so I have
      no idea if this is relevant in practise (with kernelcore=mirror).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
      
      
      Fixes: a9a9e77f ("mm: move mirrored memory specific code outside of memmap_init_zone")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      948c436e
    • David Hildenbrand's avatar
      mm/page_alloc.c: initialize memmap of unavailable memory directly · 4b094b78
      David Hildenbrand authored
      Let's make sure that all memory holes are actually marked PageReserved(),
      that page_to_pfn() produces reliable results, and that these pages are not
      detected as "mmap" pages due to the mapcount.
      
      E.g., booting a x86-64 QEMU guest with 4160 MB:
      
      [    0.010585] Early memory node ranges
      [    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
      [    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
      [    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
      
      max_pfn is 0x144000.
      
      Before this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000800           16384       64  ___________M_______________________________        mmap
                   total           16384       64
      
      After this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000100000000           16384       64  ___________________________r_______________        reserved
                   total           16384       64
      
      IOW, especially the unavailable physical memory ("memory hole") in the
      last section would not get properly marked PageReserved() and is indicated
      to be "mmap" memory.
      
      Drop the trace of that function from include/linux/mm.h - nobody else
      needs it, and rename it accordingly.
      
      Note: The fake zone/node might not be covered by the zone/node span.  This
      is not an urgent issue (for now, we had the same node/zone due to the
      zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
      type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
      marking these memory holes PageReserved().
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b094b78
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · e822969c
      David Hildenbrand authored
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      
      
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e822969c
  2. Jan 31, 2020
Loading