Case study: Adding fast zero to NBD
-[About 5 mins]
+[About 5-6 mins]
+
+- based heavily on https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
* Heading: Case study baseline
-- link to https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
-- shell to pre-create source file
+- 4000- shell to pre-create source file
+- baseline is about 8.5s
As Rich mentioned, qemu-img convert is a great tool for copying guest
images, with support for NBD on both source and destination. However,
most guest images are sparse, and we want to avoid naively reading
lots of zeroes on the source then writing lots of zeroes on the
destination. Here's a case study of optimizing that, starting with a
-baseline of straightline copying.
-
-* Heading: Nothing to see here
-
-The NBD extension of WRITE_ZEROES made it faster to write large blocks
-of zeroes to the destination (less network traffic). And to some
-extent, BLOCK_STATUS on the source made it easy to learn where blocks
-of zeroes are in the source, for then knowing when to use
-WRITE_ZEROES. But for an image that is rather fragmented (frequent
-alternation of holes and data), that's still a lot of individual
-commands sent over the wire, which can slow performance especially in
-scenarios when each command is serialized. Can we do better?
+baseline of straightline copying, which matches qemu 3.0 behavior.
+XXX Is that correct version? what date?
+Let's convert a 100M image, which alternates between data and holes at
+each megabyte.
+
+* Heading: Dissecting that command
+- 41x0 - series of html pages to highlight aspects of the ./convert command
+ - nbkdit, plugin server, --run command
+ - delay filter
+ - blocksize filter
+ - stats, log filters
+ - noextents filter
+ - nozero filter
+
+Okay, I went a bit fast on that ./convert command. Looking closer, it
+is using nbdkit as a server with the memory plugin as the data sink,
+tied to a single invocation of qemu-img convert as the source over a
+Unix socket. There are lots of nbdkit filters: we want to slow down
+the operation so a small disk can still be useful in performance
+testing, and where the server defaults to a zero operation that is
+faster than writes. We want to demonstrate the fact that a single
+write zero operation (with no explicit network payload) can often
+cover a larger swath of the file in one operation than writes (which
+have a maximum payload per operation), by forcing writes to split
+smaller than the 1M striping of the source image. We want to collect
+statistics, both of the overall time spent, and which operations the
+client attempted, and include a sleep to avoid an output race. For
+this case study, we disable BLOCK_STATUS with noextents (more on why
+later). And finally we use the nozero filter as our knob for what the
+server advertises to the client, as well as how it responds to various
+zero-related requests.
+
+* Heading: Writing zeroes: much ado about nothing
+- 4200- .term
+ - ./convert zeromode=plugin fastzeromode=none for server A
+ - ./convert zeromode=emulate fastzeromode=none for server B
+
+XXX - verify versions/dates
+In qemu 2.8 (Dec 2016), we implemented the NBD extension of
+WRITE_ZEROES, with the initial goal of reducing network traffic (no
+need to send an explicit payload of all zero bytes over the network).
+However, the spec was intentionally loose on implementation, with two
+common scenarios. With server A, the act of writing zeroes is heavily
+optimized - a simple constant-time metadata notation and the operation
+is done regardless of the size of the zero, and we see an immediate
+benefit in execution time, even though the amount of I/O transactions
+did not drop. With server B, writing zeroes populates a buffer in the
+server, then goes through the same path as a normal WRITE command, for
+no real difference from the baseline. But can we do better?
* Heading: What's the status?
-
-We could check before using WRITE_ZEROES whether the destination is
-already zero. If we get lucky, we can even learn from a single
-BLOCK_STATUS that the entire image already started life as all zeroes,
-so that there is no further work needed for any of the source holes.
-But luck isn't always on our side: BLOCK_STATUS itself is an extension
-and not present on all servers; and worse, at least tmpfs has an issue
-where lseek(SEEK_HOLE) is O(n) rather than O(1), so querying status
-for every hole turns our linear walk into an O(n^2) ordeal, so we
-don't want to use it more than once. So for the rest of my case
-study, I investigated what happens when BLOCK_STATUS is unavailable
-(which is in fact the case with qemu 3.0).
-
-* Heading: Tale of two servers
-
-What happens if we start by pre-zeroing the entire destination (either
-because BLOCK_STATUS proved the image did not start as zero, or was
-unavailable)? Then the remainder of the copy only has to worry about
-source data portions, and not revisit the holes; fewer commands over
-the wire should result in better performance, right? But in practice,
-we discovered an odd effect - some servers were indeed faster this
-way, but others were actually slower than the baseline of just writing
-the entire image in a single pass. This pessimization appeared in
-qemu 3.1.
-
-* Heading: The problem
-
-Even though WRITE_ZEROES results in less network traffic, the
-implementation on the server varies widely: in some servers, it really
-is an O(1) request to bulk-zero a large portion of the disk, but in
-others, it is merely sugar for an O(n) write of actual zeroes. When
-pre-zeroing an image, if you have an O(1) server, you save time, but
-if you have an O(n) server, then you are actually taking the time to
-do two full writes to every portion of the disk containing data.
+- 4300- slide: why qemu-img convert wants at most one block status
+
+Do we even have to worry whether WRITE_ZEROES will be fast or slow?
+If we know that the destination already contains all zeroes, we could
+entirely skip destination I/O for each hole in the source. qemu 2.12
+added support for NBD_CMD_BLOCK_STATUS to quickly learn whether a
+portion of a disk is a hole. But experiments with qemu-img convert
+showed that using BLOCK_STATUS as a way to avoid WRITE_ZEROES didn't
+really help, for a couple of reasons. If writing zeroes is fast,
+checking the destination first is either a mere tradeoff in commands
+(BLOCK_STATUS replacing WRITE_ZEROES when the destination is already
+zero) or a pessimization (BLOCK_STATUS still has to be followed by
+WRITE_ZEROES). And if writing zeroes is slow, we have a speedup holes
+when BLOCK_STATUS itself is fast on pre-existing destination holes,
+but we encountered situations such as tmpfs that has a linear rather
+than constant-time lseek(SEEK_HOLE) implementation, where we ended up
+with quadratic behavior all due to BLOCK_STATUS calls. Thus, for now,
+qemu-img convert does not use BLOCK_STATUS, and as mentioned earlier,
+I used the noextents filter in my test case to ensure BLOCK_STATUS is
+not interfering with timing results.
+
+* Heading: Pre-zeroing: a tale of two servers
+- 4400- .term
+ - ./convert zeromode=plugin fastzeromode=ignore for server A
+ - ./convert zeromode=emulate fastzeromode=ignore for server B
+
+But do we really even need to use BLOCK_STATUS? What if, instead, we
+just guarantee that the destination image starts life with all zeroes?
+After all, since WRITE_ZEROES has no network payload, we can just bulk
+pre-zero the image, and then skip I/O for source holes without having
+to do any further checks of destination status. qemu 3.1 took this
+approach, but quickly ran into a surprise. For server A, we have a
+speedup: fewer overall I/O transactions makes us slightly faster than
+one WRITE_ZEROES per hole. But for server B, we actually have a
+dramatic pessimization! It turns out that when writing zeroes falls
+back to a normal write path, pre-zeroing the image now forces twice
+the I/O for any data portion of the image.
* Heading: qemu's solution
-
-The pre-zeroing pass is supposed to be an optimization: what if we can
-guarantee that the server will only attempt pre-zeroing with an O(1)
-implementation, and return an error to allow a fallback to piecewise
-linear writes otherwise? Qemu added this with BDRV_REQ_NO_FALLBACK
-for qemu 4.0, but has to guess pessimistically: any server not known
-to be O(1) is assumed to be O(n), even if that assumption is wrong.
-And since NBD did not have a way to tell the client which
-implementation is in use, we lost out on the speedup for server A, but
-at least no longer have the pessimisation for server B.
-
-* Heading: Time for a protocol extension
-
-Since qemu-img convert can benefit from knowing if a server's zero
-operation is fast, it follows that NBD should offer that information
-as a server. The next step was to propose an extension to the protocol
-that preserves backwards compatibility (both client and server must
-understand the extension to utilize it, and either side lacking the
-extension should result in at most a lack of performance, but not
-compromise contents). The proposal was NBD_CMD_FLAG_FAST_ZERO.
+- 4500- slide: graph of the scenarios
+
+With the root cause for the pessimation understood, qemu folks
+addressed the situation by adding a flag BDRV_REQ_NO_FALLBACK in qemu
+4.0 (Apr 2019): when performing a pre-zeroing pass, we want zeroing to
+be fast: if it cannot succeed quickly, then it must fail rather than
+fall back to writes. For server A, the pre-zero request succeeds, and
+we've avoided all further hole I/O; while for server B, the pre-zero
+request fails but we didn't lose any time to doubled-up I/O to data
+segments. This sounds fine on paper, but has one problem: it requires
+server cooperation, and without that, the only sane default is to
+assume that zeroes are not fast, so while we avoided hurting server B,
+we ended up pessimizing server A back to one zero request per hole.
+
+* Heading: Protocol extension: time to pull a fast one
+- 4600- .term with
+ - ./convert zeromode=plugin fastzeromode=default for server A
+ - ./convert zeromode=emulate fastzeromode=default for server B
+
+So the solution is obvious: let's make nbdkit as server perform the
+necessary cooperation for qemu to request a fast zero. The NBD
+protocol added the extension NBD_CMD_FLAG_FAST_ZERO, taking care that
+both server and client must support the extension before it can be
+used, but that if it is not supported, we merely lose performance but
+do not corrupt image contents. qemu 4.2 will be the first release
+that now supports ideal performance for both server A and server B out
+of the box.
* Heading: Reference implementation
-
-No good protocol will add extensions without a reference
-implementation. And for qemu, the implementation for both server and
-client is quite simple, map a new flag in NBD to the existing qemu
-flag of BDRV_REQ_NO_FALLBACK, for both server and client; this will be
-landing in qemu 4.2. But at the same time, a single implementation,
-and unreleased at that, is hardly portable, so the NBD specification
-is reluctant to codify things without some interoperability testing.
-
-* Heading: Second implementation
-
-So, the obvious second implementation is libnbd for a client (adding a
-new flag to nbd_zero, and support for mapping a new errno value), and
-nbdkit for a server (adding a new .can_fast_zero callback for plugins
-and filters, then methodically patching all in-tree files where it can
-be reasonably implemented). Here, the power of filters stands out: by
-adding a second parameter to the existing 'nozero' filter, I could
-quickly change the behavior of any plugin on whether to advertise
-and/or honor the semantics of the new flag.
-
-* Heading: Show me the numbers
-
-When submitting the patches, a concrete example was easiest to prove
-the patch matters. So I set up a test-bed on a 100M image with every
-other megabyte being a hole:
-
-.sh file with setup, core function
-
-and with filters in place to artificially slow the image down (data
-writes slower than zeroes, and no data write larger than 256k, block
-status disabled) and observe behavior (log to see what the client
-requests based on handshake results, stats to get timing numbers for
-overall performance). Then by tweaking the 'nozero' filter
-parameters, I was able to recreate qemu 3.0 behavior (baseline
-straight copy), qemu 3.1 behavior (blind pre-zeroing pass with speedup
-or slowdown based on zeroing implementation), qemu 4.0 behavior (no
-speedups without detection of fast zero support, but at least nothing
-worse than baseline), and qemu 4.2 behavior (take advantage of
-pre-zeroing when it helps, while still nothing worse than baseline).
+- 4700- nbdkit filters/plugins that were adjusted
+
+The qemu implementation was quite trivial (map the new NBD flag to the
+existing BDRV_REQ_NO_FALLBACK flag, in both client and server). But
+to actually get the NBD extension into the protocol, it's better to
+prove that the extension will be interoperable with other NBD
+implementations. So, the obvious second implementation is libnbd for
+a client (adding a new flag to nbd_zero, and support for mapping a new
+errno value, due out in 1.2), and nbdkit for a server (adding a new
+.can_fast_zero callback for plugins and filters, then methodically
+patching all in-tree files where it can be reasonably implemented,
+released in 1.14). Among other things, the nbdkit changes to the
+nozero filter added the parameters I used in my demo for controlling
+whether to advertise and/or honor the semantics of the new flag.
+
+* segue: XXX
+slide 4800-? Or just sentence leading into Rich's demos?
+
+I just showed a case study of how nbdkit helped address a real-life
+optimization issue. Now let's see some of the more esoteric things
+that the NBD protocol makes possible.