Skip to content
Snippets Groups Projects
  1. Oct 14, 2020
    • Dan Williams's avatar
      mm/memremap_pages: convert to 'struct range' · a4574f63
      Dan Williams authored
      The 'struct resource' in 'struct dev_pagemap' is only used for holding
      resource span information.  The other fields, 'name', 'flags', 'desc',
      'parent', 'sibling', and 'child' are all unused wasted space.
      
      This is in preparation for introducing a multi-range extension of
      devm_memremap_pages().
      
      The bulk of this change is unwinding all the places internal to libnvdimm
      that used 'struct resource' unnecessarily, and replacing instances of
      'struct dev_pagemap'.res with 'struct dev_pagemap'.range.
      
      P2PDMA had a minor usage of the resource flags field, but only to report
      failures with "%pR".  That is replaced with an open coded print of the
      range.
      
      [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
        Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam
      
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>	[xen]
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4574f63
  2. Sep 04, 2020
  3. Apr 10, 2020
    • Logan Gunthorpe's avatar
      mm/memremap: set caching mode for PCI P2PDMA memory to WC · a50d8d98
      Logan Gunthorpe authored
      
      PCI BAR IO memory should never be mapped as WB, however prior to this
      the PAT bits were set WB and it was typically overridden by MTRR
      registers set by the firmware.
      
      Set PCI P2PDMA memory to be UC as this is what it currently, typically,
      ends up being mapped as on x86 after the MTRR registers override the
      cache setting.
      
      Future use-cases may need to generalize this by adding flags to select
      the caching type, as some P2PDMA cases may not want UC.  However, those
      use-cases are not upstream yet and this can be changed when they arrive.
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a50d8d98
    • Logan Gunthorpe's avatar
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe authored
      
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • Logan Gunthorpe's avatar
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe authored
      
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
  4. Mar 26, 2020
  5. Feb 21, 2020
    • Dan Williams's avatar
      mm/memremap_pages: Introduce memremap_compat_align() · 9ffc1d19
      Dan Williams authored
      The "sub-section memory hotplug" facility allows memremap_pages() users
      like libnvdimm to compensate for hardware platforms like x86 that have a
      section size larger than their hardware memory mapping granularity.  The
      compensation that sub-section support affords is being tolerant of
      physical memory resources shifting by units smaller (64MiB on x86) than
      the memory-hotplug section size (128 MiB). Where the platform
      physical-memory mapping granularity is limited by the number and
      capability of address-decode-registers in the memory controller.
      
      While the sub-section support allows memremap_pages() to operate on
      sub-section (2MiB) granularity, the Power architecture may still
      require 16MiB alignment on "!radix_enabled()" platforms.
      
      In order for libnvdimm to be able to detect and manage this per-arch
      limitation, introduce memremap_compat_align() as a common minimum
      alignment across all driver-facing memory-mapping interfaces, and let
      Power override it to 16MiB in the "!radix_enabled()" case.
      
      The assumption / requirement for 16MiB to be a viable
      memremap_compat_align() value is that Power does not have platforms
      where its equivalent of address-decode-registers never hardware remaps a
      persistent memory resource on smaller than 16MiB boundaries. Note that I
      tried my best to not add a new Kconfig symbol, but header include
      entanglements defeated the #ifndef memremap_compat_align design pattern
      and the need to export it defeats the __weak design pattern for arch
      overrides.
      
      Based on an initial patch by Aneesh.
      
      Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com
      
      
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      9ffc1d19
  6. Feb 04, 2020
  7. Jan 31, 2020
    • John Hubbard's avatar
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard authored
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
    • Dan Williams's avatar
      mm: Cleanup __put_devmap_managed_page() vs ->page_free() · 429589d6
      Dan Williams authored
      After the removal of the device-public infrastructure there are only 2
      ->page_free() call backs in the kernel.  One of those is a
      device-private callback in the nouveau driver, the other is a generic
      wakeup needed in the DAX case.  In the hopes that all ->page_free()
      callbacks can be migrated to common core kernel functionality, move the
      device-private specific actions in __put_devmap_managed_page() under the
      is_device_private_page() conditional, including the ->page_free()
      callback.  For the other page types just open-code the generic wakeup.
      
      Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
      does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
      case.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      429589d6
  8. Jan 04, 2020
    • David Hildenbrand's avatar
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand authored
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      
      
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  9. Oct 19, 2019
    • Aneesh Kumar K.V's avatar
      mm/memunmap: don't access uninitialized memmap in memunmap_pages() · 77e080e7
      Aneesh Kumar K.V authored
      Patch series "mm/memory_hotplug: Shrink zones before removing memory",
      v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect
      the ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
      of code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of
      DIMMs at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
        :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
        :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  98304
                present  65536
                managed  65536
        :/# echo 0 > /sys/devices/system/memory/memory43/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  32768
                present  32768
                managed  32768
        :/# echo 0 > /sys/devices/system/memory/memory41/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
      
      This patch (of 10):
      
      With an altmap, the memmap falling into the reserved altmap space are not
      initialized and, therefore, contain a garbage NID and a garbage zone.
      Make sure to read the NID/zone from a memmap that was initialized.
      
      This fixes a kernel crash that is observed when destroying a namespace:
      
        kernel BUG at include/linux/mm.h:1107!
        cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
            pc: c0000000004b9728: memunmap_pages+0x238/0x340
            lr: c0000000004b9724: memunmap_pages+0x234/0x340
        ...
            pid   = 3669, comm = ndctl
        kernel BUG at include/linux/mm.h:1107!
          devm_action_release+0x30/0x50
          release_nodes+0x268/0x2d0
          device_release_driver_internal+0x174/0x240
          unbind_store+0x13c/0x190
          drv_attr_store+0x44/0x60
          sysfs_kf_write+0x70/0xa0
          kernfs_fop_write+0x1ac/0x290
          __vfs_write+0x3c/0x70
          vfs_write+0xe4/0x200
          ksys_write+0x7c/0x140
          system_call+0x5c/0x68
      
      The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f ("mm,
      devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
      think we will never have driver reserved memory with
      MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
      
      [david@redhat.com: minimze code changes, rephrase description]
      Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
      
      
      Fixes: 2c2a5af6 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77e080e7
  10. Oct 07, 2019
  11. Aug 13, 2019
    • Ralph Campbell's avatar
      mm/hmm: fix ZONE_DEVICE anon page mapping reuse · 7ab0ad0e
      Ralph Campbell authored
      When a ZONE_DEVICE private page is freed, the page->mapping field can be
      set.  If this page is reused as an anonymous page, the previous value
      can prevent the page from being inserted into the CPU's anon rmap table.
      For example, when migrating a pte_none() page to device memory:
      
        migrate_vma(ops, vma, start, end, src, dst, private)
          migrate_vma_collect()
            src[] = MIGRATE_PFN_MIGRATE
          migrate_vma_prepare()
            /* no page to lock or isolate so OK */
          migrate_vma_unmap()
            /* no page to unmap so OK */
          ops->alloc_and_copy()
            /* driver allocates ZONE_DEVICE page for dst[] */
          migrate_vma_pages()
            migrate_vma_insert_page()
              page_add_new_anon_rmap()
                __page_set_anon_rmap()
                  /* This check sees the page's stale mapping field */
                  if (PageAnon(page))
                    return
                  /* page->mapping is not updated */
      
      The result is that the migration appears to succeed but a subsequent CPU
      fault will be unable to migrate the page back to system memory or worse.
      
      Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
      pointer data doesn't affect future page use.
      
      Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
      
      
      Fixes: b7a52310 ("mm: don't clear ->mapping in hmm_devmem_free")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab0ad0e
  12. Aug 09, 2019
  13. Aug 03, 2019
  14. Jul 19, 2019
  15. Jul 02, 2019
  16. Jun 14, 2019
  17. May 14, 2019
  18. Dec 28, 2018
  19. Oct 26, 2018
  20. Oct 21, 2018
  21. Aug 24, 2018
  22. Aug 17, 2018
    • Andrey Ryabinin's avatar
      kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN · 0207df4f
      Andrey Ryabinin authored
      KASAN learns about hotadded memory via the memory hotplug notifier.
      devm_memremap_pages() intentionally skips calling memory hotplug
      notifiers.  So KASAN doesn't know anything about new memory added by
      devm_memremap_pages().  This causes a crash when KASAN tries to access
      non-existent shadow memory:
      
       BUG: unable to handle kernel paging request at ffffed0078000000
       RIP: 0010:check_memory_region+0x82/0x1e0
       Call Trace:
        memcpy+0x1f/0x50
        pmem_do_bvec+0x163/0x720
        pmem_make_request+0x305/0xac0
        generic_make_request+0x54f/0xcf0
        submit_bio+0x9c/0x370
        submit_bh_wbc+0x4c7/0x700
        block_read_full_page+0x5ef/0x870
        do_read_cache_page+0x2b8/0xb30
        read_dev_sector+0xbd/0x3f0
        read_lba.isra.0+0x277/0x670
        efi_partition+0x41a/0x18f0
        check_partition+0x30d/0x5e9
        rescan_partitions+0x18c/0x840
        __blkdev_get+0x859/0x1060
        blkdev_get+0x23f/0x810
        __device_add_disk+0x9c8/0xde0
        pmem_attach_disk+0x9a8/0xf50
        nvdimm_bus_probe+0xf3/0x3c0
        driver_probe_device+0x493/0xbd0
        bus_for_each_drv+0x118/0x1b0
        __device_attach+0x1cd/0x2b0
        bus_probe_device+0x1ac/0x260
        device_add+0x90d/0x1380
        nd_async_device_register+0xe/0x50
        async_run_entry_fn+0xc3/0x5d0
        process_one_work+0xa0a/0x1810
        worker_thread+0x87/0xe80
        kthread+0x2d7/0x390
        ret_from_fork+0x3a/0x50
      
      Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
      interface to map/unmap kasan_zero_page at requested virtual addresses.
      And use it to add/remove the shadow memory for hotplugged/unplugged
      device memory.
      
      Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
      
      
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0207df4f
  23. Jul 27, 2018
Loading