--- /dev/null
+Case study: Adding fast zero to NBD
+[About 5 mins]
+
+* Heading: Copying disk images
+
+As Rich mentioned, qemu-img convert is a great for copying guest
+images. However, most guest images are sparse, and we want to avoid
+naively reading lots of zeroes on the source then writing lots of
+zeroes on the destination; although this setup makes a great baseline.
+
+* Heading: Nothing to see here
+
+The NBD extension of WRITE_ZEROES made it faster to write large blocks
+of zeroes to the destination (less network traffic). And to some
+extent, BLOCK_STATUS on the source made it easy to learn where blocks
+of zeroes are in the source, for then knowing when to use
+WRITE_ZEROES. But for an image that is rather fragmented (frequent
+alternation of holes and data), that's still a lot of individual
+commands sent over the wire, which can slow performance especially in
+scenarios when each command is serialized. Can we do better?
+
+* Heading: What's the status?
+
+We could check before using WRITE_ZEROES whether the destination is
+already zero. If we get lucky, we can even learn from a single
+BLOCK_STATUS that the entire image already started life as all zeroes,
+so that there is no further work needed for any of the source holes.
+But luck isn't always on our side: BLOCK_STATUS itself is an extension
+and not present on all servers; and worse, at least tmpfs has an issue
+where lseek(SEEK_HOLE) is O(n) rather than O(1), so querying status
+for every hole turns our linear walk into an O(n^2) ordeal, so we
+don't want to use it more than once. So for the rest of my case
+study, I investigated what happens when BLOCK_STATUS is unavailable
+(which is in fact the case with qemu 3.1).
+
+* Heading: Tale of two servers
+
+What happens if we start by pre-zeroing the entire destination (either
+because BLOCK_STATUS proved the image did not start as zero, or was
+unavailable)? Then the remainder of the copy only has to worry about
+source data portions, and not revisit the holes; fewer commands over
+the wire should result in better performance, right? But in practice,
+we discovered an odd effect - some servers were indeed faster this
+way, but others were actually slower than the baseline of just writing
+the entire image in a single pass.
+
+* Heading: The problem
+
+Even though WRITE_ZEROES results in less network traffic, the
+implementation on the server varies widely: in some servers, it really
+is an O(1) request to bulk-zero a large portion of the disk, but in
+others, it is merely sugar for an O(n) write of actual zeroes. When
+pre-zeroing an image, if you have an O(1) server, you save time, but
+if you have an O(n) server, then you are actually taking the time to
+do two full writes to every portion of the disk containing data.
+
+* Heading: qemu's solution
+
+The pre-zeroing pass is supposed to be an optimization: what if we can
+guarantee that the server will only attempt pre-zeroing with an O(1)
+implementation, and return an error to allow a fallback to piecewise
+linear writes otherwise? Qemu added this with BDRV_REQ_NO_FALLBACK
+for qemu 4.0, but has to guess pessimistically: any server not known
+to be O(1) is assumed to be O(n), even if that assumption is wrong.
+And since NBD did not have a way to tell the client which
+implementation is in use, we lost out on the speedup for server A, but
+at least no longer have the pessimisation for server B.
+
+* Heading: Time for a protocol extension
+
+Since qemu-img convert can benefit from knowing if a server's zero
+operation is fast, it follows that NBD should offer that information
+as a server. The next step was to propose an extension to the protocol
+that preserves backwards compatibility (both client and server must
+understand the extension to utilize it, and either side lacking the
+extension should result in at most a lack of performance, but not
+compromise contents). The proposal was NBD_CMD_FLAG_FAST_ZERO.
+
+* Heading: Reference implementation
+
+No good protocol will add extensions without a reference
+implementation. And for qemu, the implementation for both server and
+client is quite simple, map a new flag in NBD to the existing qemu
+flag of BDRV_REQ_NO_FALLBACK, for both server and client; this will be
+landing in qemu 4.2. But at the same time, a single implementation,
+and unreleased at that, is hardly portable, so the NBD specification
+is reluctant to codify things without some interoperability testing.
+
+* Heading: Second implementation
+
+So, the obvious second implementation is libnbd for a client (adding a
+new flag to nbd_zero, and support for mapping a new errno value), and
+nbdkit for a server (adding a new .can_fast_zero callback for plugins
+and filters, then methodically patching all in-tree files where it can
+be reasonably implemented). Here, the power of filters stands out: by
+adding a second parameter to the existing 'nozero' filter, I could
+quickly change the behavior of any plugin on whether to advertise
+and/or honor the semantics of the new flag.
+
+* Heading: Show me the numbers
+
+When submitting the patches, a concrete example was easiest to prove
+the patch matters. So I set up a test-bed on a 100M image with every
+other megabyte being a hole:
+
+.sh file with setup, core function
+
+and with filters in place to artificially slow the image down (data
+writes slower than zeroes, and no data write larger than 256k, block
+status disabled) and observe behavior (log to see what the client
+requests based on handshake results, stats to get timing numbers for
+overall performance). Then by tweaking the 'nozero' filter
+parameters, I was able to recreate qemu 2.11 behavior (baseline
+straight copy), qemu 3.1 behavior (blind pre-zeroing pass with speedup
+or slowdown based on zeroing implementation), qemu 4.0 behavior (no
+speedups without detection of fast zero support, but at least nothing
+worse than baseline), and qemu 4.2 behavior (take advantage of
+pre-zeroing when it helps, while still nothing worse than baseline).