2019-kvm-forum/notes-04-fast-zero

   1 Case study: Adding fast zero to NBD
   2 [About 5 mins]
   3
   4 * Heading: Copying disk images
   5
   6 As Rich mentioned, qemu-img convert is a great for copying guest
   7 images.  However, most guest images are sparse, and we want to avoid
   8 naively reading lots of zeroes on the source then writing lots of
   9 zeroes on the destination; although this setup makes a great baseline.
  10
  11 * Heading: Nothing to see here
  12
  13 The NBD extension of WRITE_ZEROES made it faster to write large blocks
  14 of zeroes to the destination (less network traffic).  And to some
  15 extent, BLOCK_STATUS on the source made it easy to learn where blocks
  16 of zeroes are in the source, for then knowing when to use
  17 WRITE_ZEROES.  But for an image that is rather fragmented (frequent
  18 alternation of holes and data), that's still a lot of individual
  19 commands sent over the wire, which can slow performance especially in
  20 scenarios when each command is serialized.  Can we do better?
  21
  22 * Heading: What's the status?
  23
  24 We could check before using WRITE_ZEROES whether the destination is
  25 already zero.  If we get lucky, we can even learn from a single
  26 BLOCK_STATUS that the entire image already started life as all zeroes,
  27 so that there is no further work needed for any of the source holes.
  28 But luck isn't always on our side: BLOCK_STATUS itself is an extension
  29 and not present on all servers; and worse, at least tmpfs has an issue
  30 where lseek(SEEK_HOLE) is O(n) rather than O(1), so querying status
  31 for every hole turns our linear walk into an O(n^2) ordeal, so we
  32 don't want to use it more than once.  So for the rest of my case
  33 study, I investigated what happens when BLOCK_STATUS is unavailable
  34 (which is in fact the case with qemu 3.1).
  35
  36 * Heading: Tale of two servers
  37
  38 What happens if we start by pre-zeroing the entire destination (either
  39 because BLOCK_STATUS proved the image did not start as zero, or was
  40 unavailable)?  Then the remainder of the copy only has to worry about
  41 source data portions, and not revisit the holes; fewer commands over
  42 the wire should result in better performance, right?  But in practice,
  43 we discovered an odd effect - some servers were indeed faster this
  44 way, but others were actually slower than the baseline of just writing
  45 the entire image in a single pass.
  46
  47 * Heading: The problem
  48
  49 Even though WRITE_ZEROES results in less network traffic, the
  50 implementation on the server varies widely: in some servers, it really
  51 is an O(1) request to bulk-zero a large portion of the disk, but in
  52 others, it is merely sugar for an O(n) write of actual zeroes.  When
  53 pre-zeroing an image, if you have an O(1) server, you save time, but
  54 if you have an O(n) server, then you are actually taking the time to
  55 do two full writes to every portion of the disk containing data.
  56
  57 * Heading: qemu's solution
  58
  59 The pre-zeroing pass is supposed to be an optimization: what if we can
  60 guarantee that the server will only attempt pre-zeroing with an O(1)
  61 implementation, and return an error to allow a fallback to piecewise
  62 linear writes otherwise?  Qemu added this with BDRV_REQ_NO_FALLBACK
  63 for qemu 4.0, but has to guess pessimistically: any server not known
  64 to be O(1) is assumed to be O(n), even if that assumption is wrong.
  65 And since NBD did not have a way to tell the client which
  66 implementation is in use, we lost out on the speedup for server A, but
  67 at least no longer have the pessimisation for server B.
  68
  69 * Heading: Time for a protocol extension
  70
  71 Since qemu-img convert can benefit from knowing if a server's zero
  72 operation is fast, it follows that NBD should offer that information
  73 as a server. The next step was to propose an extension to the protocol
  74 that preserves backwards compatibility (both client and server must
  75 understand the extension to utilize it, and either side lacking the
  76 extension should result in at most a lack of performance, but not
  77 compromise contents).  The proposal was NBD_CMD_FLAG_FAST_ZERO.
  78
  79 * Heading: Reference implementation
  80
  81 No good protocol will add extensions without a reference
  82 implementation. And for qemu, the implementation for both server and
  83 client is quite simple, map a new flag in NBD to the existing qemu
  84 flag of BDRV_REQ_NO_FALLBACK, for both server and client; this will be
  85 landing in qemu 4.2.  But at the same time, a single implementation,
  86 and unreleased at that, is hardly portable, so the NBD specification
  87 is reluctant to codify things without some interoperability testing.
  88
  89 * Heading: Second implementation
  90
  91 So, the obvious second implementation is libnbd for a client (adding a
  92 new flag to nbd_zero, and support for mapping a new errno value), and
  93 nbdkit for a server (adding a new .can_fast_zero callback for plugins
  94 and filters, then methodically patching all in-tree files where it can
  95 be reasonably implemented).  Here, the power of filters stands out: by
  96 adding a second parameter to the existing 'nozero' filter, I could
  97 quickly change the behavior of any plugin on whether to advertise
  98 and/or honor the semantics of the new flag.
  99
 100 * Heading: Show me the numbers
 101
 102 When submitting the patches, a concrete example was easiest to prove
 103 the patch matters.  So I set up a test-bed on a 100M image with every
 104 other megabyte being a hole:
 105
 106 .sh file with setup, core function
 107
 108 and with filters in place to artificially slow the image down (data
 109 writes slower than zeroes, and no data write larger than 256k, block
 110 status disabled) and observe behavior (log to see what the client
 111 requests based on handshake results, stats to get timing numbers for
 112 overall performance).  Then by tweaking the 'nozero' filter
 113 parameters, I was able to recreate qemu 2.11 behavior (baseline
 114 straight copy), qemu 3.1 behavior (blind pre-zeroing pass with speedup
 115 or slowdown based on zeroing implementation), qemu 4.0 behavior (no
 116 speedups without detection of fast zero support, but at least nothing
 117 worse than baseline), and qemu 4.2 behavior (take advantage of
 118 pre-zeroing when it helps, while still nothing worse than baseline).