Skip to content
Snippets Groups Projects
  1. Aug 26, 2020
    • Huacai Chen's avatar
      rtc: goldfish: Enable interrupt in set_alarm() when necessary · 59e8bcc1
      Huacai Chen authored
      
      [ Upstream commit 22f8d5a1 ]
      
      When use goldfish rtc, the "hwclock" command fails with "select() to
      /dev/rtc to wait for clock tick timed out". This is because "hwclock"
      need the set_alarm() hook to enable interrupt when alrm->enabled is
      true. This operation is missing in goldfish rtc (but other rtc drivers,
      such as cmos rtc, enable interrupt here), so add it.
      
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Signed-off-by: default avatarJiaxun Yang <jiaxun.yang@flygoat.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/1592654683-31314-1-git-send-email-chenhc@lemote.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      59e8bcc1
    • Chuhong Yuan's avatar
      media: budget-core: Improve exception handling in budget_register() · 87f1b49e
      Chuhong Yuan authored
      
      [ Upstream commit fc045645 ]
      
      budget_register() has no error handling after its failure.
      Add the missed undo functions for error handling to fix it.
      
      Signed-off-by: default avatarChuhong Yuan <hslester96@gmail.com>
      Signed-off-by: default avatarSean Young <sean@mess.org>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      87f1b49e
    • Stanley Chu's avatar
      scsi: ufs: Add DELAY_BEFORE_LPM quirk for Micron devices · 046f3765
      Stanley Chu authored
      [ Upstream commit c0a18ee0 ]
      
      It is confirmed that Micron device needs DELAY_BEFORE_LPM quirk to have a
      delay before VCC is powered off. Sdd Micron vendor ID and this quirk for
      Micron devices.
      
      Link: https://lore.kernel.org/r/20200612012625.6615-2-stanley.chu@mediatek.com
      
      
      Reviewed-by: default avatarBean Huo <beanhuo@micron.com>
      Reviewed-by: default avatarAlim Akhtar <alim.akhtar@samsung.com>
      Signed-off-by: default avatarStanley Chu <stanley.chu@mediatek.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      046f3765
    • Lukas Wunner's avatar
      spi: Prevent adding devices below an unregistering controller · 413a0634
      Lukas Wunner authored
      
      [ Upstream commit ddf75be4 ]
      
      CONFIG_OF_DYNAMIC and CONFIG_ACPI allow adding SPI devices at runtime
      using a DeviceTree overlay or DSDT patch.  CONFIG_SPI_SLAVE allows the
      same via sysfs.
      
      But there are no precautions to prevent adding a device below a
      controller that's being removed.  Such a device is unusable and may not
      even be able to unbind cleanly as it becomes inaccessible once the
      controller has been torn down.  E.g. it is then impossible to quiesce
      the device's interrupt.
      
      of_spi_notify() and acpi_spi_notify() do hold a ref on the controller,
      but otherwise run lockless against spi_unregister_controller().
      
      Fix by holding the spi_add_lock in spi_unregister_controller() and
      bailing out of spi_add_device() if the controller has been unregistered
      concurrently.
      
      Fixes: ce79d54a ("spi/of: Add OF notifier handler")
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: stable@vger.kernel.org # v3.19+
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Octavian Purdila <octavian.purdila@intel.com>
      Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
      Link: https://lore.kernel.org/r/a8c3205088a969dc8410eec1eba9aface60f36af.1596451035.git.lukas@wunner.de
      
      
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      413a0634
    • zhangyi (F)'s avatar
      jbd2: add the missing unlock_buffer() in the error path of jbd2_write_superblock() · 81545a2a
      zhangyi (F) authored
      
      commit ef3f5830 upstream.
      
      jbd2_write_superblock() is under the buffer lock of journal superblock
      before ending that superblock write, so add a missing unlock_buffer() in
      in the error path before submitting buffer.
      
      Fixes: 742b06b5 ("jbd2: check superblock mapped prior to committing")
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Reviewed-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20200620061948.2049579-1-yi.zhang@huawei.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81545a2a
    • Jan Kara's avatar
      ext4: fix checking of directory entry validity for inline directories · f9d723d0
      Jan Kara authored
      
      commit 7303cb5b upstream.
      
      ext4_search_dir() and ext4_generic_delete_entry() can be called both for
      standard director blocks and for inline directories stored inside inode
      or inline xattr space. For the second case we didn't call
      ext4_check_dir_entry() with proper constraints that could result in
      accepting corrupted directory entry as well as false positive filesystem
      errors like:
      
      EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
      block 113246792: comm dockerd: bad entry in directory: directory entry too
      close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
      size=4096
      
      Fix the arguments passed to ext4_check_dir_entry().
      
      Fixes: 109ba779 ("ext4: check for directory entries too close to block end")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9d723d0
    • Charan Teja Reddy's avatar
      mm, page_alloc: fix core hung in free_pcppages_bulk() · 0063bb82
      Charan Teja Reddy authored
      commit 88e8ac11 upstream.
      
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      
      
      Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: default avatarCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0063bb82
    • Doug Berger's avatar
      mm: include CMA pages in lowmem_reserve at boot · d93b51bc
      Doug Berger authored
      
      commit e08d3fdf upstream.
      
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d93b51bc
    • Wei Yongjun's avatar
      kernel/relay.c: fix memleak on destroy relay channel · 6b7be0bc
      Wei Yongjun authored
      
      commit 71e84329 upstream.
      
      kmemleak report memory leak as follows:
      
        unreferenced object 0x607ee4e5f948 (size 8):
        comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s)
        hex dump (first 8 bytes):
        00 00 00 00 00 00 00 00 ........
        backtrace:
           relay_open kernel/relay.c:583 [inline]
           relay_open+0xb6/0x970 kernel/relay.c:563
           do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557
           __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597
           blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738
           blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613
           block_ioctl+0xe5/0x120 fs/block_dev.c:1871
           vfs_ioctl fs/ioctl.c:48 [inline]
           __do_sys_ioctl fs/ioctl.c:753 [inline]
           __se_sys_ioctl fs/ioctl.c:739 [inline]
           __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739
           do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      'chan->buf' is malloced in relay_open() by alloc_percpu() but not free
      while destroy the relay channel.  Fix it by adding free_percpu() before
      return from relay_destroy_channel().
      
      Fixes: 017c59c0 ("relay: Use per CPU constructs for the relay channel buffer pointers")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Akash Goel <akash.goel@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b7be0bc
    • Jann Horn's avatar
      romfs: fix uninitialized memory leak in romfs_dev_read() · 89346bc3
      Jann Horn authored
      
      commit bcf85fce upstream.
      
      romfs has a superblock field that limits the size of the filesystem; data
      beyond that limit is never accessed.
      
      romfs_dev_read() fetches a caller-supplied number of bytes from the
      backing device.  It returns 0 on success or an error code on failure;
      therefore, its API can't represent short reads, it's all-or-nothing.
      
      However, when romfs_dev_read() detects that the requested operation would
      cross the filesystem size limit, it currently silently truncates the
      requested number of bytes.  This e.g.  means that when the content of a
      file with size 0x1000 starts one byte before the filesystem size limit,
      ->readpage() will only fill a single byte of the supplied page while
      leaving the rest uninitialized, leaking that uninitialized memory to
      userspace.
      
      Fix it by returning an error code instead of truncating the read when the
      requested read operation would go beyond the end of the filesystem.
      
      Fixes: da4458bd ("NOMMU: Make it possible for RomFS to use MTD devices directly")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89346bc3
    • Josef Bacik's avatar
      btrfs: sysfs: use NOFS for device creation · 58253e22
      Josef Bacik authored
      
      [ Upstream commit a47bd78d ]
      
      Dave hit this splat during testing btrfs/078:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.8.0-rc6-default+ #1191 Not tainted
        ------------------------------------------------------
        kswapd0/75 is trying to acquire lock:
        ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      
        but task is already holding lock:
        ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 __kmalloc_track_caller+0x49/0x330
      	 kstrdup+0x2e/0x60
      	 __kernfs_new_node.constprop.0+0x44/0x250
      	 kernfs_new_node+0x25/0x50
      	 kernfs_create_link+0x34/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs]
      	 btrfs_init_new_device+0x44c/0x12b0 [btrfs]
      	 btrfs_ioctl+0xc3c/0x25c0 [btrfs]
      	 ksys_ioctl+0x68/0xa0
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x50/0xe0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 btrfs_chunk_alloc+0x137/0x3e0 [btrfs]
      	 find_free_extent+0xb44/0xfb0 [btrfs]
      	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
      	 btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
      	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
      	 __btrfs_cow_block+0x143/0x7a0 [btrfs]
      	 btrfs_cow_block+0x15f/0x310 [btrfs]
      	 push_leaf_right+0x150/0x240 [btrfs]
      	 split_leaf+0x3cd/0x6d0 [btrfs]
      	 btrfs_search_slot+0xd14/0xf70 [btrfs]
      	 btrfs_insert_empty_items+0x64/0xc0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs]
      	 btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs]
      	 btrfs_work_helper+0x2f9/0x650 [btrfs]
      	 process_one_work+0x22c/0x600
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x98/0xa20
      	 validate_chain+0xa8c/0x2a00
      	 __lock_acquire+0x56f/0xaa0
      	 lock_acquire+0xa3/0x440
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
      	 btrfs_evict_inode+0x3bf/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x175/0x420
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x192/0x600
      	 balance_pgdat+0x31f/0x750
      	 kswapd+0x206/0x510
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/75:
         #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x78/0xa0
         check_noncircular+0x16f/0x190
         check_prev_add+0x98/0xa20
         validate_chain+0xa8c/0x2a00
         __lock_acquire+0x56f/0xaa0
         lock_acquire+0xa3/0x440
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? __lock_acquire+0x56f/0xaa0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         ? lock_acquire+0xa3/0x440
         ? btrfs_evict_inode+0x138/0x560 [btrfs]
         ? btrfs_evict_inode+0x2fe/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
         btrfs_evict_inode+0x3bf/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x175/0x420
         shrink_slab+0xb1/0x2e0
         shrink_node+0x192/0x600
         balance_pgdat+0x31f/0x750
         kswapd+0x206/0x510
         ? _raw_spin_unlock_irqrestore+0x3e/0x50
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_stop+0x2a0/0x2a0
         ret_from_fork+0x1f/0x30
      
      This is because we're holding the chunk_mutex while adding this device
      and adding its sysfs entries.  We actually hold different locks in
      different places when calling this function, the dev_replace semaphore
      for instance in dev replace, so instead of moving this call around
      simply wrap it's operations in NOFS.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      58253e22
    • Qu Wenruo's avatar
      btrfs: inode: fix NULL pointer dereference if inode doesn't need compression · e0b8bbf2
      Qu Wenruo authored
      
      [ Upstream commit 1e6e238c ]
      
      [BUG]
      There is a bug report of NULL pointer dereference caused in
      compress_file_extent():
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
        NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
        LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
        Call Trace:
        [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
        [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
        [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
        [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
        [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
        [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
        [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
        ---[ end trace f16954aa20d822f6 ]---
      
      [CAUSE]
      For the following execution route of compress_file_range(), it's
      possible to hit NULL pointer dereference:
      
       compress_file_extent()
       |- pages = NULL;
       |- start = async_chunk->start = 0;
       |- end = async_chunk = 4095;
       |- nr_pages = 1;
       |- inode_need_compress() == false; <<< Possible, see later explanation
       |  Now, we have nr_pages = 1, pages = NULL
       |- cont:
       |- 		ret = cow_file_range_inline();
       |- 		if (ret <= 0) {
       |-		for (i = 0; i < nr_pages; i++) {
       |-			WARN_ON(pages[i]->mapping);	<<< Crash
      
      To enter above call execution branch, we need the following race:
      
          Thread 1 (chattr)     |            Thread 2 (writeback)
      --------------------------+------------------------------
                                | btrfs_run_delalloc_range
                                | |- inode_need_compress = true
                                | |- cow_file_range_async()
      btrfs_ioctl_set_flag()    |
      |- binode_flags |=        |
         BTRFS_INODE_NOCOMPRESS |
                                | compress_file_range()
                                | |- inode_need_compress = false
                                | |- nr_page = 1 while pages = NULL
                                | |  Then hit the crash
      
      [FIX]
      This patch will fix it by checking @pages before doing accessing it.
      This patch is only designed as a hot fix and easy to backport.
      
      More elegant fix may make btrfs only check inode_need_compress() once to
      avoid such race, but that would be another story.
      
      Reported-by: default avatarLuciano Chavez <chavez@us.ibm.com>
      Fixes: 4d3a800e ("btrfs: merge nr_pages input and output parameter in compress_pages")
      CC: stable@vger.kernel.org # 4.14.x: cecc8d90: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e0b8bbf2
    • Nikolay Borisov's avatar
      btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range · c05c73db
      Nikolay Borisov authored
      
      [ Upstream commit cecc8d90 ]
      
      This label is only executed if compress_file_range fails to create an
      inline extent. So move its code in the semantically related inline
      extent handling branch. No functional changes.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c05c73db
    • Josef Bacik's avatar
      btrfs: don't show full path of bind mounts in subvol= · 82ba99cf
      Josef Bacik authored
      
      [ Upstream commit 3ef3959b ]
      
      Chris Murphy reported a problem where rpm ostree will bind mount a bunch
      of things for whatever voodoo it's doing.  But when it does this
      /proc/mounts shows something like
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0
      
      Despite subvolid=256 being subvol=/foo.  This is because we're just
      spitting out the dentry of the mount point, which in the case of bind
      mounts is the source path for the mountpoint.  Instead we should spit
      out the path to the actual subvol.  Fix this by looking up the name for
      the subvolid we have mounted.  With this fix the same test looks like
      this
      
        /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
        /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0
      
      Reported-by: default avatarChris Murphy <chris@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      82ba99cf
    • Marcos Paulo de Souza's avatar
      btrfs: export helpers for subvolume name/id resolution · 84774515
      Marcos Paulo de Souza authored
      
      [ Upstream commit c0c907a4 ]
      
      The functions will be used outside of export.c and super.c to allow
      resolving subvolume name from a given id, eg. for subvolume deletion by
      id ioctl.
      
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ split from the next patch ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      84774515
    • Michael Ellerman's avatar
      powerpc: Allow 4224 bytes of stack expansion for the signal frame · 8af91971
      Michael Ellerman authored
      
      [ Upstream commit 63dee5df ]
      
      We have powerpc specific logic in our page fault handling to decide if
      an access to an unmapped address below the stack pointer should expand
      the stack VMA.
      
      The code was originally added in 2004 "ported from 2.4". The rough
      logic is that the stack is allowed to grow to 1MB with no extra
      checking. Over 1MB the access must be within 2048 bytes of the stack
      pointer, or be from a user instruction that updates the stack pointer.
      
      The 2048 byte allowance below the stack pointer is there to cover the
      288 byte "red zone" as well as the "about 1.5kB" needed by the signal
      delivery code.
      
      Unfortunately since then the signal frame has expanded, and is now
      4224 bytes on 64-bit kernels with transactional memory enabled. This
      means if a process has consumed more than 1MB of stack, and its stack
      pointer lies less than 4224 bytes from the next page boundary, signal
      delivery will fault when trying to expand the stack and the process
      will see a SEGV.
      
      The total size of the signal frame is the size of struct rt_sigframe
      (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
      64-bit).
      
      The 2048 byte allowance was correct until 2008 as the signal frame
      was:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1440 */
              /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1440    16 */
              unsigned int               tramp[6];             /*  1456    24 */
              struct siginfo *           pinfo;                /*  1480     8 */
              void *                     puc;                  /*  1488     8 */
              struct siginfo     info;                         /*  1496   128 */
              /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1624   288 */
      
              /* size: 1920, cachelines: 15, members: 7 */
              /* padding: 8 */
      };
      
      1920 + 128 = 2048
      
      Then in commit ce48b210 ("powerpc: Add VSX context save/restore,
      ptrace and signal support") (Jul 2008) the signal frame expanded to
      2304 bytes:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */	<--
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1696    16 */
              unsigned int               tramp[6];             /*  1712    24 */
              struct siginfo *           pinfo;                /*  1736     8 */
              void *                     puc;                  /*  1744     8 */
              struct siginfo     info;                         /*  1752   128 */
              /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1880   288 */
      
              /* size: 2176, cachelines: 17, members: 7 */
              /* padding: 8 */
      };
      
      2176 + 128 = 2304
      
      At this point we should have been exposed to the bug, though as far as
      I know it was never reported. I no longer have a system old enough to
      easily test on.
      
      Then in 2010 commit 320b2b8d ("mm: keep a guard page below a
      grow-down stack segment") caused our stack expansion code to never
      trigger, as there was always a VMA found for a write up to PAGE_SIZE
      below r1.
      
      That meant the bug was hidden as we continued to expand the signal
      frame in commit 2b0a576d ("powerpc: Add new transactional memory
      state to the signal context") (Feb 2013):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */	<--
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[288];          /*  3576   288 */
      
              /* size: 3872, cachelines: 31, members: 8 */
              /* padding: 8 */
              /* last cacheline: 32 bytes */
      };
      
      3872 + 128 = 4000
      
      And commit 573ebfa6 ("powerpc: Increase stack redzone for 64-bit
      userspace to 512 bytes") (Feb 2014):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[512];          /*  3576   512 */	<--
      
              /* size: 4096, cachelines: 32, members: 8 */
              /* padding: 8 */
      };
      
      4096 + 128 = 4224
      
      Then finally in 2017, commit 1be7107f ("mm: larger stack guard
      gap, between vmas") exposed us to the existing bug, because it changed
      the stack VMA to be the correct/real size, meaning our stack expansion
      code is now triggered.
      
      Fix it by increasing the allowance to 4224 bytes.
      
      Hard-coding 4224 is obviously unsafe against future expansions of the
      signal frame in the same way as the existing code. We can't easily use
      sizeof() because the signal frame structure is not in a header. We
      will either fix that, or rip out all the custom stack expansion
      checking logic entirely.
      
      Fixes: ce48b210 ("powerpc: Add VSX context save/restore, ptrace and signal support")
      Cc: stable@vger.kernel.org # v2.6.27+
      Reported-by: default avatarTom Lane <tgl@sss.pgh.pa.us>
      Tested-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8af91971
    • Christophe Leroy's avatar
      powerpc/mm: Only read faulting instruction when necessary in do_page_fault() · 55069b03
      Christophe Leroy authored
      
      [ Upstream commit 0e36b0d1 ]
      
      Commit a7a9dcd8 ("powerpc: Avoid taking a data miss on every
      userspace instruction miss") has shown that limiting the read of
      faulting instruction to likely cases improves performance.
      
      This patch goes further into this direction by limiting the read
      of the faulting instruction to the only cases where it is likely
      needed.
      
      On an MPC885, with the same benchmark app as in the commit referred
      above, we see a reduction of about 3900 dTLB misses (approx 3%):
      
      Before the patch:
       Performance counter stats for './fault 500' (10 runs):
      
               683033312      cpu-cycles                                                    ( +-  0.03% )
                  134538      dTLB-load-misses                                              ( +-  0.03% )
                   46099      iTLB-load-misses                                              ( +-  0.02% )
                   19681      faults                                                        ( +-  0.02% )
      
             5.389747878 seconds time elapsed                                          ( +-  0.06% )
      
      With the patch:
      
       Performance counter stats for './fault 500' (10 runs):
      
               682112862      cpu-cycles                                                    ( +-  0.03% )
                  130619      dTLB-load-misses                                              ( +-  0.03% )
                   46073      iTLB-load-misses                                              ( +-  0.05% )
                   19681      faults                                                        ( +-  0.01% )
      
             5.381342641 seconds time elapsed                                          ( +-  0.07% )
      
      The proper work of the huge stack expansion was tested with the
      following app:
      
      int main(int argc, char **argv)
      {
      	char buf[1024 * 1025];
      
      	sprintf(buf, "Hello world !\n");
      	printf(buf);
      
      	exit(0);
      }
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Add include of pagemap.h to fix build errors]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      55069b03
    • Hugh Dickins's avatar
      khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() · d45ce5e4
      Hugh Dickins authored
      
      [ Upstream commit f3f99d63 ]
      
      syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
      __khugepaged_enter(): yes, when one thread is about to dump core, has set
      core_state, and is waiting for others, another might do something calling
      __khugepaged_enter(), which now crashes because I lumped the core_state
      test (known as "mmget_still_valid") into khugepaged_test_exit().  I still
      think it's best to lump them together, so just in this exceptional case,
      check mm->mm_users directly instead of khugepaged_test_exit().
      
      Fixes: bbe98f9c ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d45ce5e4
    • Hugh Dickins's avatar
      khugepaged: khugepaged_test_exit() check mmget_still_valid() · 0b383dae
      Hugh Dickins authored
      
      [ Upstream commit bbe98f9c ]
      
      Move collapse_huge_page()'s mmget_still_valid() check into
      khugepaged_test_exit() itself.  collapse_huge_page() is used for anon THP
      only, and earned its mmget_still_valid() check because it inserts a huge
      pmd entry in place of the page table's pmd entry; whereas
      collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
      merely clears the page table's pmd entry.  But core dumping without mmap
      lock must have been as open to mistaking a racily cleared pmd entry for a
      page table at physical page 0, as exit_mmap() was.  And we certainly have
      no interest in mapping as a THP once dumping core.
      
      Fixes: 59ea6d06 ("coredump: fix race condition between collapse_huge_page() and core dumping")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0b383dae
    • Masami Hiramatsu's avatar
      perf probe: Fix memory leakage when the probe point is not found · 61c4decf
      Masami Hiramatsu authored
      
      [ Upstream commit 12d572e7 ]
      
      Fix the memory leakage in debuginfo__find_trace_events() when the probe
      point is not found in the debuginfo. If there is no probe point found in
      the debuginfo, debuginfo__find_probes() will NOT return -ENOENT, but 0.
      
      Thus the caller of debuginfo__find_probes() must check the tf.ntevs and
      release the allocated memory for the array of struct probe_trace_event.
      
      The current code releases the memory only if the debuginfo__find_probes()
      hits an error but not checks tf.ntevs. In the result, the memory allocated
      on *tevs are not released if tf.ntevs == 0.
      
      This fixes the memory leakage by checking tf.ntevs == 0 in addition to
      ret < 0.
      
      Fixes: ff741783 ("perf probe: Introduce debuginfo to encapsulate dwarf information")
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Link: http://lore.kernel.org/lkml/159438668346.62703.10887420400718492503.stgit@devnote2
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      61c4decf
    • Chris Wilson's avatar
      drm/vgem: Replace opencoded version of drm_gem_dumb_map_offset() · 5a9ed4e6
      Chris Wilson authored
      
      [ Upstream commit 119c53d2 ]
      
      drm_gem_dumb_map_offset() now exists and does everything
      vgem_gem_dump_map does and *ought* to do.
      
      In particular, vgem_gem_dumb_map() was trying to reject mmapping an
      imported dmabuf by checking the existence of obj->filp. Unfortunately,
      we always allocated an obj->filp, even if unused for an imported dmabuf.
      Instead, the drm_gem_dumb_map_offset(), since commit 90378e58
      ("drm/gem: drm_gem_dumb_map_offset(): reject dma-buf"), uses the
      obj->import_attach to reject such invalid mmaps.
      
      This prevents vgem from allowing userspace mmapping the dumb handle and
      attempting to incorrectly fault in remote pages belonging to another
      device, where there may not even be a struct page.
      
      v2: Use the default drm_gem_dumb_map_offset() callback
      
      Fixes: af33a919 ("drm/vgem: Enable dmabuf import interfaces")
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: <stable@vger.kernel.org> # v4.13+
      Link: https://patchwork.freedesktop.org/patch/msgid/20200708154911.21236-1-chris@chris-wilson.co.uk
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5a9ed4e6
  2. Aug 21, 2020
Loading