Skip to content
Snippets Groups Projects
  1. Apr 07, 2020
  2. Apr 02, 2020
    • Mike Kravetz's avatar
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz authored
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
  3. Mar 26, 2020
  4. Feb 04, 2020
    • Steven Price's avatar
      mm: pagewalk: add 'depth' parameter to pte_hole · b7a16c7a
      Steven Price authored
      The pte_hole() callback is called at multiple levels of the page tables.
      Code dumping the kernel page tables needs to know what at what depth the
      missing entry is.  Add this is an extra parameter to pte_hole().  When the
      depth isn't know (e.g.  processing a vma) then -1 is passed.
      
      The depth that is reported is the actual level where the entry is missing
      (ignoring any folding that is in place), i.e.  any levels where
      PTRS_PER_P?D is set to 1 are ignored.
      
      Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
      natural numbers as levels 2/3/4.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
      
      
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Tested-by: default avatarZong Li <zong.li@sifive.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7a16c7a
  5. Jan 31, 2020
  6. Jan 04, 2020
  7. Dec 01, 2019
  8. Sep 26, 2019
  9. Sep 24, 2019
  10. Sep 07, 2019
  11. Aug 20, 2019
  12. Aug 03, 2019
  13. Jul 19, 2019
  14. Jul 06, 2019
    • Linus Torvalds's avatar
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds authored
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
      
      
      Reported-and-bisected-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  15. Jul 02, 2019
  16. May 14, 2019
  17. Mar 29, 2019
    • Lars Persson's avatar
      mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate · d2b2c6dd
      Lars Persson authored
      Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
      and SIGSEGV that could not be traced back to a userspace code bug.  They
      had all the magic signs of an I/D cache coherency issue.
      
      Now recently we noticed that the /proc/sys/vm/compact_memory interface
      was quite efficient at provoking this class of userspace crashes.
      
      Studying the code in mm/migrate.c there is a distinction made between
      migrating a page that is mapped at the instant of migration and one that
      is not mapped.  Our problem turned out to be the non-mapped pages.
      
      For the non-mapped page the code performs a copy of the page content and
      all relevant meta-data of the page without doing the required D-cache
      maintenance.  This leaves dirty data in the D-cache of the CPU and on
      the 1004K cores this data is not visible to the I-cache.  A subsequent
      page-fault that triggers a mapping of the page will happily serve the
      process with potentially stale code.
      
      What about ARM then, this bug should have seen greater exposure? Well
      ARM became immune to this flaw back in 2010, see commit c0177800
      ("ARM: 6379/1: Assume new page cache pages have dirty D-cache").
      
      My proposed fix moves the D-cache maintenance inside move_to_new_page to
      make it common for both cases.
      
      Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
      
      
      Fixes: 97ee0524 ("flush cache before installing new page at migraton")
      Signed-off-by: default avatarLars Persson <larper@axis.com>
      Reviewed-by: default avatarPaul Burton <paul.burton@mips.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2b2c6dd
  18. Mar 06, 2019
  19. Mar 01, 2019
    • Mike Kravetz's avatar
      hugetlbfs: fix races and page leaks during migration · cb6acd01
      Mike Kravetz authored
      hugetlb pages should only be migrated if they are 'active'.  The
      routines set/clear_page_huge_active() modify the active state of hugetlb
      pages.
      
      When a new hugetlb page is allocated at fault time, set_page_huge_active
      is called before the page is locked.  Therefore, another thread could
      race and migrate the page while it is being added to page table by the
      fault code.  This race is somewhat hard to trigger, but can be seen by
      strategically adding udelay to simulate worst case scheduling behavior.
      Depending on 'how' the code races, various BUG()s could be triggered.
      
      To address this issue, simply delay the set_page_huge_active call until
      after the page is successfully added to the page table.
      
      Hugetlb pages can also be leaked at migration time if the pages are
      associated with a file in an explicitly mounted hugetlbfs filesystem.
      For example, consider a two node system with 4GB worth of huge pages
      available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
      then migrates the pages associated with the file from one node to
      another.  When the program exits, huge page counts are as follows:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        0       free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      That is as expected.  2G of huge pages are taken from the free_hugepages
      counts, and 2G is the size of the file in the explicitly mounted
      filesystem.  If the file is then removed, the counts become:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        1024    free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      Note that the filesystem still shows 2G of pages used, while there
      actually are no huge pages in use.  The only way to 'fix' the filesystem
      accounting is to unmount the filesystem
      
      If a hugetlb page is associated with an explicitly mounted filesystem,
      this information in contained in the page_private field.  At migration
      time, this information is not preserved.  To fix, simply transfer
      page_private from old to new page at migration time if necessary.
      
      There is a related race with removing a huge page from a file and
      migration.  When a huge page is removed from the pagecache, the
      page_mapping() field is cleared, yet page_private remains set until the
      page is actually freed by free_huge_page().  A page could be migrated
      while in this state.  However, since page_mapping() is not set the
      hugetlbfs specific routine to transfer page_private is not called and we
      leak the page count in the filesystem.
      
      To fix that, check for this condition before migrating a huge page.  If
      the condition is detected, return EBUSY for the page.
      
      Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
      Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
      
      
      Fixes: bcc54222 ("mm: hugetlb: introduce page_huge_active")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      [mike.kravetz@oracle.com: v2]
        Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
      [mike.kravetz@oracle.com: update comment and changelog]
        Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
      
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb6acd01
  20. Feb 01, 2019
    • David Hildenbrand's avatar
      mm: migrate: don't rely on __PageMovable() of newpage after unlocking it · e0a352fa
      David Hildenbrand authored
      We had a race in the old balloon compaction code before b1123ea6
      ("mm: balloon: use general non-lru movable page feature") refactored it
      that became visible after backporting 195a8c43 ("virtio-balloon:
      deflate via a page list") without the refactoring.
      
      The bug existed from commit d6d86c0a ("mm/balloon_compaction:
      redesign ballooned pages management") till b1123ea6 ("mm: balloon:
      use general non-lru movable page feature").  d6d86c0a
      ("mm/balloon_compaction: redesign ballooned pages management") was
      backported to 3.12, so the broken kernels are stable kernels [3.12 -
      4.7].
      
      There was a subtle race between dropping the page lock of the newpage in
      __unmap_and_move() and checking for __is_movable_balloon_page(newpage).
      
      Just after dropping this page lock, virtio-balloon could go ahead and
      deflate the newpage, effectively dequeueing it and clearing PageBalloon,
      in turn making __is_movable_balloon_page(newpage) fail.
      
      This resulted in dropping the reference of the newpage via
      putback_lru_page(newpage) instead of put_page(newpage), leading to
      page->lru getting modified and a !LRU page ending up in the LRU lists.
      With 195a8c43 ("virtio-balloon: deflate via a page list")
      backported, one would suddenly get corrupted lists in
      release_pages_balloon():
      
      - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
      - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100
      
      Nowadays this race is no longer possible, but it is hidden behind very
      ugly handling of __ClearPageMovable() and __PageMovable().
      
      __ClearPageMovable() will not make __PageMovable() fail, only
      PageMovable().  So the new check (__PageMovable(newpage)) will still
      hold even after newpage was dequeued by virtio-balloon.
      
      If anybody would ever change that special handling, the BUG would be
      introduced again.  So instead, make it explicit and use the information
      of the original isolated page before migration.
      
      This patch can be backported fairly easy to stable kernels (in contrast
      to the refactoring).
      
      Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
      
      
      Fixes: d6d86c0a ("mm/balloon_compaction: redesign ballooned pages management")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarVratislav Bendel <vbendel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vratislav Bendel <vbendel@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[3.12 - 4.7]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0a352fa
Loading