Case study: Adding fast zero to NBD [About 5 mins] * Heading: Case study baseline - link to https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html - shell to pre-create source file As Rich mentioned, qemu-img convert is a great tool for copying guest images, with support for NBD on both source and destination. However, most guest images are sparse, and we want to avoid naively reading lots of zeroes on the source then writing lots of zeroes on the destination. Here's a case study of optimizing that, starting with a baseline of straightline copying. * Heading: Nothing to see here The NBD extension of WRITE_ZEROES made it faster to write large blocks of zeroes to the destination (less network traffic). And to some extent, BLOCK_STATUS on the source made it easy to learn where blocks of zeroes are in the source, for then knowing when to use WRITE_ZEROES. But for an image that is rather fragmented (frequent alternation of holes and data), that's still a lot of individual commands sent over the wire, which can slow performance especially in scenarios when each command is serialized. Can we do better? * Heading: What's the status? We could check before using WRITE_ZEROES whether the destination is already zero. If we get lucky, we can even learn from a single BLOCK_STATUS that the entire image already started life as all zeroes, so that there is no further work needed for any of the source holes. But luck isn't always on our side: BLOCK_STATUS itself is an extension and not present on all servers; and worse, at least tmpfs has an issue where lseek(SEEK_HOLE) is O(n) rather than O(1), so querying status for every hole turns our linear walk into an O(n^2) ordeal, so we don't want to use it more than once. So for the rest of my case study, I investigated what happens when BLOCK_STATUS is unavailable (which is in fact the case with qemu 3.0). * Heading: Tale of two servers What happens if we start by pre-zeroing the entire destination (either because BLOCK_STATUS proved the image did not start as zero, or was unavailable)? Then the remainder of the copy only has to worry about source data portions, and not revisit the holes; fewer commands over the wire should result in better performance, right? But in practice, we discovered an odd effect - some servers were indeed faster this way, but others were actually slower than the baseline of just writing the entire image in a single pass. This pessimization appeared in qemu 3.1. * Heading: The problem Even though WRITE_ZEROES results in less network traffic, the implementation on the server varies widely: in some servers, it really is an O(1) request to bulk-zero a large portion of the disk, but in others, it is merely sugar for an O(n) write of actual zeroes. When pre-zeroing an image, if you have an O(1) server, you save time, but if you have an O(n) server, then you are actually taking the time to do two full writes to every portion of the disk containing data. * Heading: qemu's solution The pre-zeroing pass is supposed to be an optimization: what if we can guarantee that the server will only attempt pre-zeroing with an O(1) implementation, and return an error to allow a fallback to piecewise linear writes otherwise? Qemu added this with BDRV_REQ_NO_FALLBACK for qemu 4.0, but has to guess pessimistically: any server not known to be O(1) is assumed to be O(n), even if that assumption is wrong. And since NBD did not have a way to tell the client which implementation is in use, we lost out on the speedup for server A, but at least no longer have the pessimisation for server B. * Heading: Time for a protocol extension Since qemu-img convert can benefit from knowing if a server's zero operation is fast, it follows that NBD should offer that information as a server. The next step was to propose an extension to the protocol that preserves backwards compatibility (both client and server must understand the extension to utilize it, and either side lacking the extension should result in at most a lack of performance, but not compromise contents). The proposal was NBD_CMD_FLAG_FAST_ZERO. * Heading: Reference implementation No good protocol will add extensions without a reference implementation. And for qemu, the implementation for both server and client is quite simple, map a new flag in NBD to the existing qemu flag of BDRV_REQ_NO_FALLBACK, for both server and client; this will be landing in qemu 4.2. But at the same time, a single implementation, and unreleased at that, is hardly portable, so the NBD specification is reluctant to codify things without some interoperability testing. * Heading: Second implementation So, the obvious second implementation is libnbd for a client (adding a new flag to nbd_zero, and support for mapping a new errno value), and nbdkit for a server (adding a new .can_fast_zero callback for plugins and filters, then methodically patching all in-tree files where it can be reasonably implemented). Here, the power of filters stands out: by adding a second parameter to the existing 'nozero' filter, I could quickly change the behavior of any plugin on whether to advertise and/or honor the semantics of the new flag. * Heading: Show me the numbers When submitting the patches, a concrete example was easiest to prove the patch matters. So I set up a test-bed on a 100M image with every other megabyte being a hole: .sh file with setup, core function and with filters in place to artificially slow the image down (data writes slower than zeroes, and no data write larger than 256k, block status disabled) and observe behavior (log to see what the client requests based on handshake results, stats to get timing numbers for overall performance). Then by tweaking the 'nozero' filter parameters, I was able to recreate qemu 3.0 behavior (baseline straight copy), qemu 3.1 behavior (blind pre-zeroing pass with speedup or slowdown based on zeroing implementation), qemu 4.0 behavior (no speedups without detection of fast zero support, but at least nothing worse than baseline), and qemu 4.2 behavior (take advantage of pre-zeroing when it helps, while still nothing worse than baseline).