2019-kvm-forum/notes-04-fast-zero

   1 Case study: Adding fast zero to NBD
   2 [About 5-6 mins]
   3
   4 - based heavily on https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
   5
   6 * Heading: Case study baseline
   7 - 4000- shell to pre-create source file
   8 - baseline is about 8.5s
   9
  10 As Rich mentioned, qemu-img convert is a great tool for copying guest
  11 images, with support for NBD on both source and destination.  However,
  12 most guest images are sparse, and we want to avoid naively reading
  13 lots of zeroes on the source then writing lots of zeroes on the
  14 destination.  Here's a case study of our last three years in
  15 optimizing that, starting with a baseline of straightline copying,
  16 which matches qemu 2.7 behavior (Sep 2016).  Let's convert a 100M
  17 image, which alternates between data and holes at each megabyte.
  18
  19 The ./convert command show here is rather long; if you're interested
  20 in its origins, my patch submission in Aug 2019 goes into more
  21 details.  But for now, just think of it as a fancy way to run
  22 'qemu-img convert' against a server where I can tweak server behavior
  23 to control which zeroing-related features are advertised or
  24 implemented.
  25
  26 * Heading: Writing zeroes: much ado about nothing
  27 - 4200- .term
  28   - ./convert zeromode=plugin fastzeromode=none for server A
  29   - ./convert zeromode=emulate fastzeromode=none for server B
  30
  31 In qemu 2.8 (Dec 2016), we implemented the NBD extension of
  32 WRITE_ZEROES, with the initial goal of reducing network traffic (no
  33 need to send an explicit payload of all zero bytes over the network).
  34 However, the spec was intentionally loose on implementation, with two
  35 common scenarios.  With server A, the act of writing zeroes is heavily
  36 optimized - a simple constant-time metadata notation and the operation
  37 is done regardless of the size of the zero, and we see an immediate
  38 benefit in execution time, even though the amount of I/O transactions
  39 did not drop.  With server B, writing zeroes populates a buffer in the
  40 server, then goes through the same path as a normal WRITE command, for
  41 no real difference from the baseline.  But can we do better?
  42
  43 * Heading: What's the status?
  44 - 4300- slide: why qemu-img convert wants at most one block status
  45
  46 Do we even have to worry whether WRITE_ZEROES will be fast or slow?
  47 If we know that the destination already contains all zeroes, we could
  48 entirely skip destination I/O for each hole in the source.  qemu 2.12
  49 (Apr 2018) added support for NBD_CMD_BLOCK_STATUS to quickly learn
  50 whether a portion of a disk is a hole.  But experiments with qemu-img
  51 convert showed that using BLOCK_STATUS as a way to avoid WRITE_ZEROES
  52 didn't really help, for a couple of reasons.  If writing zeroes is
  53 fast, checking the destination first is either a mere tradeoff in
  54 commands (BLOCK_STATUS replacing WRITE_ZEROES when the destination is
  55 already zero) or a pessimization (BLOCK_STATUS still has to be
  56 followed by WRITE_ZEROES).  And if writing zeroes is slow, we have a
  57 speedup holes when BLOCK_STATUS itself is fast on pre-existing
  58 destination holes, but we encountered situations such as tmpfs that
  59 has a linear rather than constant-time lseek(SEEK_HOLE)
  60 implementation, where we ended up with quadratic behavior all due to
  61 BLOCK_STATUS calls.  Thus, for now, qemu-img convert does not use
  62 BLOCK_STATUS, and as mentioned earlier, I used the noextents filter in
  63 my test case to ensure BLOCK_STATUS is not interfering with timing
  64 results.
  65
  66 * Heading: Pre-zeroing: a tale of two servers
  67 - 4400- .term
  68   - ./convert zeromode=plugin fastzeromode=ignore for server A
  69   - ./convert zeromode=emulate fastzeromode=ignore for server B
  70
  71 But do we really even need to use BLOCK_STATUS?  What if, instead, we
  72 just guarantee that the destination image starts life with all zeroes?
  73 After all, since WRITE_ZEROES has no network payload, we can just bulk
  74 pre-zero the image, and then skip I/O for source holes without having
  75 to do any further checks of destination status.  qemu 3.1 took this
  76 approach, but quickly ran into a surprise.  For server A, we have a
  77 speedup: fewer overall I/O transactions makes us slightly faster than
  78 one WRITE_ZEROES per hole.  But for server B, we actually have a
  79 dramatic pessimization!  It turns out that when writing zeroes falls
  80 back to a normal write path, pre-zeroing the image now forces twice
  81 the I/O for any data portion of the image.
  82
  83 * Heading: qemu's solution
  84 - 4500- slide: graph of the scenarios
  85
  86 With the root cause for the pessimation understood, qemu folks
  87 addressed the situation by adding a flag BDRV_REQ_NO_FALLBACK in qemu
  88 4.0 (Apr 2019): when performing a pre-zeroing pass, we want zeroing to
  89 be fast: if it cannot succeed quickly, then it must fail rather than
  90 fall back to writes.  For server A, the pre-zero request succeeds, and
  91 we've avoided all further hole I/O; while for server B, the pre-zero
  92 request fails but we didn't lose any time to doubled-up I/O to data
  93 segments.  This sounds fine on paper, but has one problem: it requires
  94 server cooperation, and without that, the only sane default is to
  95 assume that zeroes are not fast, so while we avoided hurting server B,
  96 we ended up pessimizing server A back to one zero request per hole.
  97
  98 * Heading: Protocol extension: time to pull a fast one
  99 - 4600- .term with
 100   - ./convert zeromode=plugin fastzeromode=default for server A
 101   - ./convert zeromode=emulate fastzeromode=default for server B
 102
 103 So the solution is obvious: let's make nbdkit as server perform the
 104 necessary cooperation for qemu to request a fast zero.  The NBD
 105 protocol added the extension NBD_CMD_FLAG_FAST_ZERO, taking care that
 106 both server and client must support the extension before it can be
 107 used, but that if it is not supported, we merely lose performance but
 108 do not corrupt image contents.  qemu 4.2 will be the first release
 109 that now supports ideal performance for both server A and server B out
 110 of the box.
 111
 112 * Heading: Reference implementation
 113 - 4700- nbdkit filters/plugins that were adjusted
 114
 115 The qemu implementation was quite trivial (map the new NBD flag to the
 116 existing BDRV_REQ_NO_FALLBACK flag, in both client and server, due out
 117 in qemu 4.2).  But to actually get the NBD extension into the
 118 protocol, it's better to prove that the extension will be
 119 interoperable with other NBD implementations.  So, the obvious second
 120 implementation is libnbd for a client (adding a new flag to nbd_zero,
 121 and support for mapping a new errno value, due out in 1.2), and nbdkit
 122 for a server (adding a new .can_fast_zero callback for plugins and
 123 filters, then methodically patching all in-tree files where it can be
 124 reasonably implemented, due out in 1.16).  Among other things, the
 125 nbdkit changes to the nozero filter added the parameters I used in my
 126 demo for controlling whether to advertise and/or honor the semantics
 127 of the new flag.
 128
 129 [if time:] Note that the file plugin was not touched in the initial
 130 patches. This is because accurate support is harder than it looks:
 131 both fallocate(FALLOC_FL_ZERO_RANGE) and ioctl(BLKZEROOUT) can trigger
 132 fallbacks to slow writes, so we would need kernel support for new
 133 interfaces that guarantee fast failure.
 134
 135 * segue: XXX
 136 slide 4800-? Or just sentence leading into Rich's demos?
 137
 138 I just showed a case study of how nbdkit helped address a real-life
 139 optimization issue.  Now let's see some of the more esoteric things
 140 that the NBD protocol makes possible.