From 7264ba1591103d2a062e55a7f59dc8c0fd6b309f Mon Sep 17 00:00:00 2001 From: Eric Blake Date: Wed, 23 Oct 2019 10:31:07 -0500 Subject: [PATCH] Initial fast zero case study notes. Slides will be in the 4000- range --- 2019-kvm-forum/notes-04-fast-zero | 118 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 2019-kvm-forum/notes-04-fast-zero diff --git a/2019-kvm-forum/notes-04-fast-zero b/2019-kvm-forum/notes-04-fast-zero new file mode 100644 index 0000000..ee1271a --- /dev/null +++ b/2019-kvm-forum/notes-04-fast-zero @@ -0,0 +1,118 @@ +Case study: Adding fast zero to NBD +[About 5 mins] + +* Heading: Copying disk images + +As Rich mentioned, qemu-img convert is a great for copying guest +images. However, most guest images are sparse, and we want to avoid +naively reading lots of zeroes on the source then writing lots of +zeroes on the destination; although this setup makes a great baseline. + +* Heading: Nothing to see here + +The NBD extension of WRITE_ZEROES made it faster to write large blocks +of zeroes to the destination (less network traffic). And to some +extent, BLOCK_STATUS on the source made it easy to learn where blocks +of zeroes are in the source, for then knowing when to use +WRITE_ZEROES. But for an image that is rather fragmented (frequent +alternation of holes and data), that's still a lot of individual +commands sent over the wire, which can slow performance especially in +scenarios when each command is serialized. Can we do better? + +* Heading: What's the status? + +We could check before using WRITE_ZEROES whether the destination is +already zero. If we get lucky, we can even learn from a single +BLOCK_STATUS that the entire image already started life as all zeroes, +so that there is no further work needed for any of the source holes. +But luck isn't always on our side: BLOCK_STATUS itself is an extension +and not present on all servers; and worse, at least tmpfs has an issue +where lseek(SEEK_HOLE) is O(n) rather than O(1), so querying status +for every hole turns our linear walk into an O(n^2) ordeal, so we +don't want to use it more than once. So for the rest of my case +study, I investigated what happens when BLOCK_STATUS is unavailable +(which is in fact the case with qemu 3.1). + +* Heading: Tale of two servers + +What happens if we start by pre-zeroing the entire destination (either +because BLOCK_STATUS proved the image did not start as zero, or was +unavailable)? Then the remainder of the copy only has to worry about +source data portions, and not revisit the holes; fewer commands over +the wire should result in better performance, right? But in practice, +we discovered an odd effect - some servers were indeed faster this +way, but others were actually slower than the baseline of just writing +the entire image in a single pass. + +* Heading: The problem + +Even though WRITE_ZEROES results in less network traffic, the +implementation on the server varies widely: in some servers, it really +is an O(1) request to bulk-zero a large portion of the disk, but in +others, it is merely sugar for an O(n) write of actual zeroes. When +pre-zeroing an image, if you have an O(1) server, you save time, but +if you have an O(n) server, then you are actually taking the time to +do two full writes to every portion of the disk containing data. + +* Heading: qemu's solution + +The pre-zeroing pass is supposed to be an optimization: what if we can +guarantee that the server will only attempt pre-zeroing with an O(1) +implementation, and return an error to allow a fallback to piecewise +linear writes otherwise? Qemu added this with BDRV_REQ_NO_FALLBACK +for qemu 4.0, but has to guess pessimistically: any server not known +to be O(1) is assumed to be O(n), even if that assumption is wrong. +And since NBD did not have a way to tell the client which +implementation is in use, we lost out on the speedup for server A, but +at least no longer have the pessimisation for server B. + +* Heading: Time for a protocol extension + +Since qemu-img convert can benefit from knowing if a server's zero +operation is fast, it follows that NBD should offer that information +as a server. The next step was to propose an extension to the protocol +that preserves backwards compatibility (both client and server must +understand the extension to utilize it, and either side lacking the +extension should result in at most a lack of performance, but not +compromise contents). The proposal was NBD_CMD_FLAG_FAST_ZERO. + +* Heading: Reference implementation + +No good protocol will add extensions without a reference +implementation. And for qemu, the implementation for both server and +client is quite simple, map a new flag in NBD to the existing qemu +flag of BDRV_REQ_NO_FALLBACK, for both server and client; this will be +landing in qemu 4.2. But at the same time, a single implementation, +and unreleased at that, is hardly portable, so the NBD specification +is reluctant to codify things without some interoperability testing. + +* Heading: Second implementation + +So, the obvious second implementation is libnbd for a client (adding a +new flag to nbd_zero, and support for mapping a new errno value), and +nbdkit for a server (adding a new .can_fast_zero callback for plugins +and filters, then methodically patching all in-tree files where it can +be reasonably implemented). Here, the power of filters stands out: by +adding a second parameter to the existing 'nozero' filter, I could +quickly change the behavior of any plugin on whether to advertise +and/or honor the semantics of the new flag. + +* Heading: Show me the numbers + +When submitting the patches, a concrete example was easiest to prove +the patch matters. So I set up a test-bed on a 100M image with every +other megabyte being a hole: + +.sh file with setup, core function + +and with filters in place to artificially slow the image down (data +writes slower than zeroes, and no data write larger than 256k, block +status disabled) and observe behavior (log to see what the client +requests based on handshake results, stats to get timing numbers for +overall performance). Then by tweaking the 'nozero' filter +parameters, I was able to recreate qemu 2.11 behavior (baseline +straight copy), qemu 3.1 behavior (blind pre-zeroing pass with speedup +or slowdown based on zeroing implementation), qemu 4.0 behavior (no +speedups without detection of fast zero support, but at least nothing +worse than baseline), and qemu 4.2 behavior (take advantage of +pre-zeroing when it helps, while still nothing worse than baseline). -- 1.8.3.1