2019-kvm-forum/notes-04-fast-zero

   1 Case study: Adding fast zero to NBD
   2 [About 5-6 mins]
   3
   4 - based heavily on https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
   5
   6 * Heading: Case study baseline
   7 - 4000- shell to pre-create source file
   8 - baseline is about 8.5s
   9
  10 As Rich mentioned, qemu-img convert is a great tool for copying guest
  11 images, with support for NBD on both source and destination.  However,
  12 most guest images are sparse, and we want to avoid naively reading
  13 lots of zeroes on the source then writing lots of zeroes on the
  14 destination.  Here's a case study of optimizing that, starting with a
  15 baseline of straightline copying, which matches qemu 3.0 behavior.
  16 XXX Is that correct version? what date?
  17 Let's convert a 100M image, which alternates between data and holes at
  18 each megabyte.
  19
  20 * Heading: Dissecting that command
  21 - 41x0 - series of html pages to highlight aspects of the ./convert command
  22   - nbkdit, plugin server, --run command
  23   - delay filter
  24   - blocksize filter
  25   - stats, log filters
  26   - noextents filter
  27   - nozero filter
  28
  29 Okay, I went a bit fast on that ./convert command.  Looking closer, it
  30 is using nbdkit as a server with the memory plugin as the data sink,
  31 tied to a single invocation of qemu-img convert as the source over a
  32 Unix socket.  There are lots of nbdkit filters: we want to slow down
  33 the operation so a small disk can still be useful in performance
  34 testing, and where the server defaults to a zero operation that is
  35 faster than writes.  We want to demonstrate the fact that a single
  36 write zero operation (with no explicit network payload) can often
  37 cover a larger swath of the file in one operation than writes (which
  38 have a maximum payload per operation), by forcing writes to split
  39 smaller than the 1M striping of the source image.  We want to collect
  40 statistics, both of the overall time spent, and which operations the
  41 client attempted, and include a sleep to avoid an output race.  For
  42 this case study, we disable BLOCK_STATUS with noextents (more on why
  43 later).  And finally we use the nozero filter as our knob for what the
  44 server advertises to the client, as well as how it responds to various
  45 zero-related requests.
  46
  47 * Heading: Writing zeroes: much ado about nothing
  48 - 4200- .term
  49   - ./convert zeromode=plugin fastzeromode=none for server A
  50   - ./convert zeromode=emulate fastzeromode=none for server B
  51
  52 XXX - verify versions/dates
  53 In qemu 2.8 (Dec 2016), we implemented the NBD extension of
  54 WRITE_ZEROES, with the initial goal of reducing network traffic (no
  55 need to send an explicit payload of all zero bytes over the network).
  56 However, the spec was intentionally loose on implementation, with two
  57 common scenarios.  With server A, the act of writing zeroes is heavily
  58 optimized - a simple constant-time metadata notation and the operation
  59 is done regardless of the size of the zero, and we see an immediate
  60 benefit in execution time, even though the amount of I/O transactions
  61 did not drop.  With server B, writing zeroes populates a buffer in the
  62 server, then goes through the same path as a normal WRITE command, for
  63 no real difference from the baseline.  But can we do better?
  64
  65 * Heading: What's the status?
  66 - 4300- slide: why qemu-img convert wants at most one block status
  67
  68 Do we even have to worry whether WRITE_ZEROES will be fast or slow?
  69 If we know that the destination already contains all zeroes, we could
  70 entirely skip destination I/O for each hole in the source.  qemu 2.12
  71 added support for NBD_CMD_BLOCK_STATUS to quickly learn whether a
  72 portion of a disk is a hole.  But experiments with qemu-img convert
  73 showed that using BLOCK_STATUS as a way to avoid WRITE_ZEROES didn't
  74 really help, for a couple of reasons.  If writing zeroes is fast,
  75 checking the destination first is either a mere tradeoff in commands
  76 (BLOCK_STATUS replacing WRITE_ZEROES when the destination is already
  77 zero) or a pessimization (BLOCK_STATUS still has to be followed by
  78 WRITE_ZEROES).  And if writing zeroes is slow, we have a speedup holes
  79 when BLOCK_STATUS itself is fast on pre-existing destination holes,
  80 but we encountered situations such as tmpfs that has a linear rather
  81 than constant-time lseek(SEEK_HOLE) implementation, where we ended up
  82 with quadratic behavior all due to BLOCK_STATUS calls.  Thus, for now,
  83 qemu-img convert does not use BLOCK_STATUS, and as mentioned earlier,
  84 I used the noextents filter in my test case to ensure BLOCK_STATUS is
  85 not interfering with timing results.
  86
  87 * Heading: Pre-zeroing: a tale of two servers
  88 - 4400- .term
  89   - ./convert zeromode=plugin fastzeromode=ignore for server A
  90   - ./convert zeromode=emulate fastzeromode=ignore for server B
  91
  92 But do we really even need to use BLOCK_STATUS?  What if, instead, we
  93 just guarantee that the destination image starts life with all zeroes?
  94 After all, since WRITE_ZEROES has no network payload, we can just bulk
  95 pre-zero the image, and then skip I/O for source holes without having
  96 to do any further checks of destination status.  qemu 3.1 took this
  97 approach, but quickly ran into a surprise.  For server A, we have a
  98 speedup: fewer overall I/O transactions makes us slightly faster than
  99 one WRITE_ZEROES per hole.  But for server B, we actually have a
 100 dramatic pessimization!  It turns out that when writing zeroes falls
 101 back to a normal write path, pre-zeroing the image now forces twice
 102 the I/O for any data portion of the image.
 103
 104 * Heading: qemu's solution
 105 - 4500- slide: graph of the scenarios
 106
 107 With the root cause for the pessimation understood, qemu folks
 108 addressed the situation by adding a flag BDRV_REQ_NO_FALLBACK in qemu
 109 4.0 (Apr 2019): when performing a pre-zeroing pass, we want zeroing to
 110 be fast: if it cannot succeed quickly, then it must fail rather than
 111 fall back to writes.  For server A, the pre-zero request succeeds, and
 112 we've avoided all further hole I/O; while for server B, the pre-zero
 113 request fails but we didn't lose any time to doubled-up I/O to data
 114 segments.  This sounds fine on paper, but has one problem: it requires
 115 server cooperation, and without that, the only sane default is to
 116 assume that zeroes are not fast, so while we avoided hurting server B,
 117 we ended up pessimizing server A back to one zero request per hole.
 118
 119 * Heading: Protocol extension: time to pull a fast one
 120 - 4600- .term with
 121   - ./convert zeromode=plugin fastzeromode=default for server A
 122   - ./convert zeromode=emulate fastzeromode=default for server B
 123
 124 So the solution is obvious: let's make nbdkit as server perform the
 125 necessary cooperation for qemu to request a fast zero.  The NBD
 126 protocol added the extension NBD_CMD_FLAG_FAST_ZERO, taking care that
 127 both server and client must support the extension before it can be
 128 used, but that if it is not supported, we merely lose performance but
 129 do not corrupt image contents.  qemu 4.2 will be the first release
 130 that now supports ideal performance for both server A and server B out
 131 of the box.
 132
 133 * Heading: Reference implementation
 134 - 4700- nbdkit filters/plugins that were adjusted
 135
 136 The qemu implementation was quite trivial (map the new NBD flag to the
 137 existing BDRV_REQ_NO_FALLBACK flag, in both client and server).  But
 138 to actually get the NBD extension into the protocol, it's better to
 139 prove that the extension will be interoperable with other NBD
 140 implementations.  So, the obvious second implementation is libnbd for
 141 a client (adding a new flag to nbd_zero, and support for mapping a new
 142 errno value, due out in 1.2), and nbdkit for a server (adding a new
 143 .can_fast_zero callback for plugins and filters, then methodically
 144 patching all in-tree files where it can be reasonably implemented,
 145 released in 1.14).  Among other things, the nbdkit changes to the
 146 nozero filter added the parameters I used in my demo for controlling
 147 whether to advertise and/or honor the semantics of the new flag.
 148
 149 [if time:] Note that the file plugin was not touched in the initial
 150 patches. This is because accurate support is harder than it looks:
 151 both fallocate(FALLOC_FL_ZERO_RANGE) and ioctl(BLKZEROOUT) can trigger
 152 fallbacks to slow writes, so we would need kernel support for new
 153 interfaces that guarantee fast failure.
 154
 155 * segue: XXX
 156 slide 4800-? Or just sentence leading into Rich's demos?
 157
 158 I just showed a case study of how nbdkit helped address a real-life
 159 optimization issue.  Now let's see some of the more esoteric things
 160 that the NBD protocol makes possible.