Skip to content
Snippets Groups Projects
  1. May 19, 2021
    • Omar Sandoval's avatar
      kyber: fix out of bounds access when preempted · 54dbe2d2
      Omar Sandoval authored
      
      [ Upstream commit efed9a33 ]
      
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      54dbe2d2
  2. Sep 03, 2020
  3. Sep 01, 2020
  4. May 29, 2020
  5. Jul 03, 2019
  6. Jun 20, 2019
    • Christoph Hellwig's avatar
      block: remove the bi_phys_segments field in struct bio · 14ccb66b
      Christoph Hellwig authored
      
      We only need the number of segments in the blk-mq submission path.
      Remove the field from struct bio, and return it from a variant of
      blk_queue_split instead of that it can passed as an argument to
      those functions that need the value.
      
      This also means we stop recounting segments except for cloning
      and partial segments.
      
      To keep the number of arguments in this how path down remove
      pointless struct request_queue arguments from any of the functions
      that had it and grew a nr_segs argument.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14ccb66b
  7. Apr 30, 2019
  8. Dec 20, 2018
  9. Nov 07, 2018
  10. Sep 28, 2018
  11. Sep 27, 2018
    • Omar Sandoval's avatar
      kyber: add tracepoints · 6c3b7af1
      Omar Sandoval authored
      
      When debugging Kyber, it's really useful to know what latencies we've
      been having, how the domain depths have been adjusted, and if we've
      actually been throttling. Add three tracepoints, kyber_latency,
      kyber_adjust, and kyber_throttled, to record that.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c3b7af1
    • Omar Sandoval's avatar
      kyber: implement improved heuristics · 6e25cb01
      Omar Sandoval authored
      
      Kyber's current heuristics have a few flaws:
      
      - It's based on the mean latency, but p99 latency tends to be more
        meaningful to anyone who cares about latency. The mean can also be
        skewed by rare outliers that the scheduler can't do anything about.
      - The statistics calculations are purely time-based with a short window.
        This works for steady, high load, but is more sensitive to outliers
        with bursty workloads.
      - It only considers the latency once an I/O has been submitted to the
        device, but the user cares about the time spent in the kernel, as
        well.
      
      These are shortcomings of the generic blk-stat code which doesn't quite
      fit the ideal use case for Kyber. So, this replaces the statistics with
      a histogram used to calculate percentiles of total latency and I/O
      latency, which we then use to adjust depths in a slightly more
      intelligent manner:
      
      - Sync and async writes are now the same domain.
      - Discards are a separate domain.
      - Domain queue depths are scaled by the ratio of the p99 total latency
        to the target latency (e.g., if the p99 latency is double the target
        latency, we will double the queue depth; if the p99 latency is half of
        the target latency, we can halve the queue depth).
      - We use the I/O latency to determine whether we should scale queue
        depths down: we will only scale down if any domain's I/O latency
        exceeds the target latency, which is an indicator of congestion in the
        device.
      
      These new heuristics are just as scalable as the heuristics they
      replace.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e25cb01
    • Omar Sandoval's avatar
      kyber: don't make domain token sbitmap larger than necessary · fa2a1f60
      Omar Sandoval authored
      
      The domain token sbitmaps are currently initialized to the device queue
      depth or 256, whichever is larger, and immediately resized to the
      maximum depth for that domain (256, 128, or 64 for read, write, and
      other, respectively). The sbitmap is never resized larger than that, so
      it's unnecessary to allocate a bitmap larger than the maximum depth.
      Let's just allocate it to the maximum depth to begin with. This will use
      marginally less memory, and more importantly, give us a more appropriate
      number of bits per sbitmap word.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa2a1f60
    • Omar Sandoval's avatar
      block: move call of scheduler's ->completed_request() hook · ed88660a
      Omar Sandoval authored
      
      Commit 4bc6339a ("block: move blk_stat_add() to
      __blk_mq_end_request()") consolidated some calls using ktime_get() so
      we'd only need to call it once. Kyber's ->completed_request() hook also
      calls ktime_get(), so let's move it to the same place, too.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed88660a
  12. May 30, 2018
    • Jianchao Wang's avatar
      block: kyber: make kyber more friendly with merging · a6088845
      Jianchao Wang authored
      
      Currently, kyber is very unfriendly with merging. kyber depends
      on ctx rq_list to do merging, however, most of time, it will not
      leave any requests in ctx rq_list. This is because even if tokens
      of one domain is used up, kyber will try to dispatch requests
      from other domain and flush the rq_list there.
      
      To improve this, we setup kyber_ctx_queue (kcq) which is similar
      with ctx, but it has rq_lists for different domain and build same
      mapping between kcq and khd as the ctx & hctx. Then we could merge,
      insert and dispatch for different domains separately. At the same
      time, only flush the rq_list of kcq when get domain token successfully.
      Then if one domain token is used up, the requests could be left in
      the rq_list of that domain and maybe merged with following io.
      
      Following is my test result on machine with 8 cores and NVMe card
      INTEL SSDPEKKR128G7
      
      fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
      seq/random
      +------+---------------------------------------------------------------+
      |patch?| bw(MB/s) |   iops    | slat(usec) |    clat(usec)   |  merge  |
      +----------------------------------------------------------------------+
      | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
      +----------------------------------------------------------------------+
      | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
      +----------------------------------------------------------------------+
      When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
      on my platform.
      
      Signed-off-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a6088845
  13. May 10, 2018
  14. May 09, 2018
    • Omar Sandoval's avatar
      block: get rid of struct blk_issue_stat · 544ccc8d
      Omar Sandoval authored
      
      struct blk_issue_stat squashes three things into one u64:
      
      - The time the driver started working on a request
      - The original size of the request (for the io.low controller)
      - Flags for writeback throttling
      
      It turns out that on x86_64, we have a 4 byte hole in struct request
      which we can fill with the non-timestamp fields from blk_issue_stat,
      simplifying things quite a bit.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      544ccc8d
  15. Feb 24, 2018
  16. Dec 06, 2017
    • Omar Sandoval's avatar
      kyber: fix another domain token wait queue hang · fcf38cdf
      Omar Sandoval authored
      
      Commit 8cf46660 ("kyber: fix hang on domain token wait queue") fixed
      a hang caused by leaving wait entries on the domain token wait queue
      after the __sbitmap_queue_get() retry succeeded, making that wait entry
      a "dud" which won't in turn wake more entries up. However, we can also
      get a dud entry if kyber_get_domain_token() fails once but is then
      called again and succeeds. This can happen if the hardware queue is
      rerun for some other reason, or, more likely, kyber_dispatch_request()
      tries the same domain twice.
      
      The fix is to remove our entry from the wait queue whenever we
      successfully get a token. The only complication is that we might be on
      one of many wait queues in the struct sbitmap_queue, but that's easily
      fixed by remembering which wait queue we were put on.
      
      While we're here, only initialize the wait queue entry once instead of
      on every wait, and use spin_lock_irq() instead of spin_lock_irqsave(),
      since this is always called from process context with irqs enabled.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fcf38cdf
  17. Nov 01, 2017
  18. Oct 17, 2017
    • Omar Sandoval's avatar
      kyber: fix hang on domain token wait queue · 8cf46660
      Omar Sandoval authored
      
      When we're getting a domain token, if we fail to get a token on our
      first attempt, we put the current hardware queue on a wait queue and
      then try again just in case a token was freed after our initial attempt
      but before we got on the wait queue. If this second attempt succeeds, we
      currently leave the hardware queue on the wait queue. Usually this is
      okay; we'll just run the hardware queue one extra time when another
      token is freed. However, if the hardware queue doesn't have any other
      requests waiting, then when it it gets the extra wakeup, it won't have
      anything to free and therefore won't wake up any other hardware queues.
      If tokens are limited, then we won't make forward progress and the
      device will hang.
      
      Reported-by: default avatarBin Zha <zhabin.zb@alibaba-inc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cf46660
  19. Jun 20, 2017
    • Ingo Molnar's avatar
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar authored
      
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2055da97
    • Ingo Molnar's avatar
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar authored
      
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac6424b9
  20. Jun 18, 2017
  21. May 04, 2017
  22. Apr 20, 2017
  23. Apr 14, 2017
    • Omar Sandoval's avatar
      blk-mq: introduce Kyber multiqueue I/O scheduler · 00e04393
      Omar Sandoval authored
      
      The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
      scale to multiple queues. Users configure only two knobs, the target
      read and synchronous write latencies, and the scheduler tunes itself to
      achieve that latency goal.
      
      The implementation is based on "tokens", built on top of the scalable
      bitmap library. Tokens serve as a mechanism for limiting requests. There
      are two tiers of tokens: queueing tokens and dispatch tokens.
      
      A queueing token is required to allocate a request. In fact, these
      tokens are actually the blk-mq internal scheduler tags, but the
      scheduler manages the allocation directly in order to implement its
      policy.
      
      Dispatch tokens are device-wide and split up into two scheduling
      domains: reads vs. writes. Each hardware queue dispatches batches
      round-robin between the scheduling domains as long as tokens are
      available for that domain.
      
      These tokens can be used as the mechanism to enable various policies.
      The policy Kyber uses is inspired by active queue management techniques
      for network routing, similar to blk-wbt. The scheduler monitors
      latencies and scales the number of dispatch tokens accordingly. Queueing
      tokens are used to prevent starvation of synchronous requests by
      asynchronous requests.
      
      Various extensions are possible, including better heuristics and ionice
      support. The new scheduler isn't set as the default yet.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      00e04393
Loading