Case study: Adding fast zero to NBD [About 5-6 mins] - based heavily on https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html * Heading: Case study baseline - 4000- shell to pre-create source file - baseline is about 8.5s As Rich mentioned, qemu-img convert is a great tool for copying guest images, with support for NBD on both source and destination. However, most guest images are sparse, and we want to avoid naively reading lots of zeroes on the source then writing lots of zeroes on the destination. Here's a case study of our last three years in optimizing that, starting with a baseline of straightline copying, which matches qemu 2.7 behavior (Sep 2016). Let's convert a 100M image, which alternates between data and holes at each megabyte. The ./convert command show here is rather long; if you're interested in its origins, my patch submission in Aug 2019 goes into more details. But for now, just think of it as a fancy way to run 'qemu-img convert' against a server where I can tweak server behavior to control which zeroing-related features are advertised or implemented. * Heading: Writing zeroes: much ado about nothing - 4200- .term - ./convert zeromode=plugin fastzeromode=none for server A - ./convert zeromode=emulate fastzeromode=none for server B In qemu 2.8 (Dec 2016), we implemented the NBD extension of WRITE_ZEROES, with the initial goal of reducing network traffic (no need to send an explicit payload of all zero bytes over the network). However, the spec was intentionally loose on implementation, with two common scenarios. With server A, the act of writing zeroes is heavily optimized - a simple constant-time metadata notation and the operation is done regardless of the size of the zero, and we see an immediate benefit in execution time, even though the amount of I/O transactions did not drop. With server B, writing zeroes populates a buffer in the server, then goes through the same path as a normal WRITE command, for no real difference from the baseline. But can we do better? * Heading: What's the status? - 4300- slide: why qemu-img convert wants at most one block status Do we even have to worry whether WRITE_ZEROES will be fast or slow? If we know that the destination already contains all zeroes, we could entirely skip destination I/O for each hole in the source. qemu 2.12 (Apr 2018) added support for NBD_CMD_BLOCK_STATUS to quickly learn whether a portion of a disk is a hole. But experiments with qemu-img convert showed that using BLOCK_STATUS as a way to avoid WRITE_ZEROES didn't really help, for a couple of reasons. If writing zeroes is fast, checking the destination first is either a mere tradeoff in commands (BLOCK_STATUS replacing WRITE_ZEROES when the destination is already zero) or a pessimization (BLOCK_STATUS still has to be followed by WRITE_ZEROES). Conversely, if writing zeroes is slow, the penalty of the extra check when the destination is not a hole could be in the noise, but whether or not we have a definite speedup for avoiding WRITE_ZEROES depends on whether BLOCK_STATUS itself is fast - yet we encountered situations such as tmpfs that has a linear rather than constant-time lseek(SEEK_HOLE) implementation, where we ended up with quadratic behavior all due to BLOCK_STATUS calls. Thus, for now, qemu-img convert does not use BLOCK_STATUS. * Heading: Pre-zeroing: a tale of two servers - 4400- .term - ./convert zeromode=plugin fastzeromode=ignore for server A - ./convert zeromode=emulate fastzeromode=ignore for server B But do we really even need to use BLOCK_STATUS? What if, instead, we just guarantee that the destination image starts life with all zeroes? After all, since WRITE_ZEROES has no network payload, we can just bulk pre-zero the image, and then skip I/O for source holes without having to do any further checks of destination status. qemu 3.1 took this approach, but quickly ran into a surprise. For server A, we have a speedup: fewer overall I/O transactions makes us slightly faster than one WRITE_ZEROES per hole. But for server B, we actually have a dramatic pessimization! It turns out that when writing zeroes falls back to a normal write path, pre-zeroing the image now forces twice the I/O for any data portion of the image. * Heading: qemu's solution - 4500- slide: graph of the scenarios With the root cause for the pessimation understood, qemu folks addressed the situation by adding a flag BDRV_REQ_NO_FALLBACK in qemu 4.0 (Apr 2019): when performing a pre-zeroing pass, we want zeroing to be fast: if it cannot succeed quickly, then it must fail rather than fall back to writes. For server A, the pre-zero request succeeds, and we've avoided all further hole I/O; while for server B, the pre-zero request fails but we didn't lose any time to doubled-up I/O to data segments. This sounds fine on paper, but has one problem: it requires server cooperation, and without that, the only sane default is to assume that zeroes are not fast, so while we avoided hurting server B, we ended up pessimizing server A back to one zero request per hole. * Heading: Protocol extension: time to pull a fast one - 4600- .term with - ./convert zeromode=plugin fastzeromode=default for server A - ./convert zeromode=emulate fastzeromode=default for server B So the solution is obvious: let's make nbdkit as server perform the necessary cooperation for qemu to request a fast zero. The NBD protocol added the extension NBD_CMD_FLAG_FAST_ZERO, taking care that both server and client must support the extension before it can be used, but that if it is not supported, we merely lose performance but do not corrupt image contents. qemu 4.2 will be the first release that now supports ideal performance for both server A and server B out of the box. * Heading: Reference implementation - 4700- nbdkit filters/plugins that were adjusted The qemu implementation was quite trivial (map the new NBD flag to the existing BDRV_REQ_NO_FALLBACK flag, in both client and server, due out in qemu 4.2). But to actually get the NBD extension into the protocol, it's better to prove that the extension will be interoperable with other NBD implementations. So, the obvious second implementation is libnbd for a client (adding a new flag to nbd_zero, and support for mapping a new errno value, due out in 1.2), and nbdkit for a server (adding a new .can_fast_zero callback for plugins and filters, then methodically patching all in-tree files where it can be reasonably implemented, due out in 1.16). Among other things, the nbdkit changes to the nozero filter added the parameters I used in my demo for controlling whether to advertise and/or honor the semantics of the new flag. [if time:] Note that the file plugin was not touched in the initial patches. This is because accurate support is harder than it looks: both fallocate(FALLOC_FL_ZERO_RANGE) and ioctl(BLKZEROOUT) can trigger fallbacks to slow writes, so we would need kernel support for new interfaces that guarantee fast failure. * segue: XXX slide 4800-? Or just sentence leading into Rich's demos? I just showed a case study of how nbdkit helped address a real-life optimization issue. Now let's see some of the more esoteric things that the NBD protocol makes possible.