Skip to content
Snippets Groups Projects
  1. Sep 15, 2021
  2. Apr 16, 2021
    • Yufen Yu's avatar
      block: only update parent bi_status when bio fail · 1d2310d9
      Yufen Yu authored
      
      [ Upstream commit 3edf5346 ]
      
      For multiple split bios, if one of the bio is fail, the whole
      should return error to application. But we found there is a race
      between bio_integrity_verify_fn and bio complete, which return
      io success to application after one of the bio fail. The race as
      following:
      
      split bio(READ)          kworker
      
      nvme_complete_rq
      blk_update_request //split error=0
        bio_endio
          bio_integrity_endio
            queue_work(kintegrityd_wq, &bip->bip_work);
      
                               bio_integrity_verify_fn
                               bio_endio //split bio
                                __bio_chain_endio
                                   if (!parent->bi_status)
      
                                     <interrupt entry>
                                     nvme_irq
                                       blk_update_request //parent error=7
                                       req_bio_endio
                                          bio->bi_status = 7 //parent bio
                                     <interrupt exit>
      
                                     parent->bi_status = 0
                              parent->bi_end_io() // return bi_status=0
      
      The bio has been split as two: split and parent. When split
      bio completed, it depends on kworker to do endio, while
      bio_integrity_verify_fn have been interrupted by parent bio
      complete irq handler. Then, parent bio->bi_status which have
      been set in irq handler will overwrite by kworker.
      
      In fact, even without the above race, we also need to conside
      the concurrency beteen mulitple split bio complete and update
      the same parent bi_status. Normally, multiple split bios will
      be issued to the same hctx and complete from the same irq
      vector. But if we have updated queue map between multiple split
      bios, these bios may complete on different hw queue and different
      irq vector. Then the concurrency update parent bi_status may
      cause the final status error.
      
      Suggested-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1d2310d9
  3. Oct 28, 2020
  4. Oct 15, 2020
  5. Oct 05, 2020
  6. Sep 09, 2020
  7. Aug 18, 2020
  8. Jul 31, 2020
  9. Jul 01, 2020
  10. Jun 29, 2020
  11. Jun 24, 2020
  12. Jun 05, 2020
  13. May 27, 2020
  14. May 19, 2020
  15. May 14, 2020
    • Satya Tangirala's avatar
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala authored
      
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      
      Signed-off-by: default avatarSatya Tangirala <satyat@google.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a892c8d5
  16. May 13, 2020
  17. Mar 27, 2020
  18. Mar 25, 2020
    • Christoph Hellwig's avatar
      block: move guard_bio_eod to bio.c · 29125ed6
      Christoph Hellwig authored
      
      This is bio layer functionality and not related to buffer heads.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      29125ed6
    • Konstantin Khlebnikov's avatar
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov authored
      
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • Konstantin Khlebnikov's avatar
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov authored
      
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b8bd423
  19. Mar 24, 2020
  20. Mar 18, 2020
  21. Jan 09, 2020
    • Ming Lei's avatar
      fs: move guard_bio_eod() after bio_set_op_attrs · 83c9c547
      Ming Lei authored
      
      Commit 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      adds bio_truncate() for handling bio EOD. However, bio_truncate()
      doesn't use the passed 'op' parameter from guard_bio_eod's callers.
      
      So bio_trunacate() may retrieve wrong 'op', and zering pages may
      not be done for READ bio.
      
      Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
      in submit_bh_wbc() so that bio_truncate() can always retrieve correct
      op info.
      
      Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
      used any more.
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixes: 85a8ce62 ("block: add bio_truncate to fix guard_bio_eod")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Fold in kerneldoc and bio_op() change.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      83c9c547
  22. Dec 28, 2019
    • Ming Lei's avatar
      block: add bio_truncate to fix guard_bio_eod · 85a8ce62
      Ming Lei authored
      
      Some filesystem, such as vfat, may send bio which crosses device boundary,
      and the worse thing is that the IO request starting within device boundaries
      can contain more than one segment past EOD.
      
      Commit dce30ca9 ("fs: fix guard_bio_eod to check for real EOD errors")
      tries to fix this issue by returning -EIO for this situation. However,
      this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
      may hang for ever.
      
      Also the current truncating on last segment is dangerous by updating the
      last bvec, given bvec table becomes not immutable any more, and fs bio
      users may not retrieve the truncated pages via bio_for_each_segment_all() in
      its .end_io callback.
      
      Fixes this issue by supporting multi-segment truncating. And the
      approach is simpler:
      
      - just update bio size since block layer can make correct bvec with
      the updated bio size. Then bvec table becomes really immutable.
      
      - zero all truncated segments for read bio
      
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Fixed-by: dce30ca9 ("fs: fix guard_bio_eod to check for real EOD errors")
      Reported-by: default avatar <syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85a8ce62
  23. Dec 10, 2019
  24. Dec 05, 2019
    • Justin Tee's avatar
      block: fix memleak of bio integrity data · ece841ab
      Justin Tee authored
      
      7c20f116 ("bio-integrity: stop abusing bi_end_io") moves
      bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
      and bio_endio(). This way looks wrong because bio may be freed
      without calling bio_endio(), for example, blk_rq_unprep_clone() is
      called from dm_mq_queue_rq() when the underlying queue of dm-mpath
      is busy.
      
      So memory leak of bio integrity data is caused by commit 7c20f116.
      
      Fixes this issue by re-adding bio_integrity_free() to bio_uninit().
      
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by Justin Tee <justin.tee@broadcom.com>
      
      Add commit log, and simplify/fix the original patch wroten by Justin.
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ece841ab
  25. Nov 12, 2019
    • Junichi Nomura's avatar
      block: check bi_size overflow before merge · e3a5d8e3
      Junichi Nomura authored
      
      __bio_try_merge_page() may merge a page to bio without bio_full() check
      and cause bi_size overflow.
      
      The overflow typically ends up with sd_init_command() warning on zero
      segment request with call trace like this:
      
          ------------[ cut here ]------------
          WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180
          CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1
          Workqueue: kblockd blk_mq_run_work_fn
          RIP: 0010:scsi_init_io+0x156/0x180
          RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246
          RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000
          RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118
          RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150
          R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000
          R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000
          FS:  0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0
          Call Trace:
           sd_init_command+0x326/0xb40 [sd_mod]
           scsi_queue_rq+0x502/0xaa0
           ? blk_mq_get_driver_tag+0xe7/0x120
           blk_mq_dispatch_rq_list+0x256/0x5a0
           ? elv_rb_del+0x24/0x30
           ? deadline_remove_request+0x7b/0xc0
           blk_mq_do_dispatch_sched+0xa3/0x140
           blk_mq_sched_dispatch_requests+0xfb/0x170
           __blk_mq_run_hw_queue+0x81/0x130
           blk_mq_run_work_fn+0x1b/0x20
           process_one_work+0x179/0x390
           worker_thread+0x4f/0x3e0
           kthread+0x105/0x140
           ? max_active_store+0x80/0x80
           ? kthread_bind+0x20/0x20
           ret_from_fork+0x35/0x40
          ---[ end trace f9036abf5af4a4d3 ]---
          blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0
          XFS (sdd1): writeback error on sector 2875552
      
      __bio_try_merge_page() should check the overflow before actually doing
      merge.
      
      Fixes: 07173c3e ("block: enable multipage bvecs")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e3a5d8e3
  26. Aug 22, 2019
  27. Aug 14, 2019
  28. Aug 06, 2019
Loading