From 9f049b65d2331a9f3dbc942de5a9783b919d842d Mon Sep 17 00:00:00 2001 From: Eric Blake Date: Thu, 24 Oct 2019 15:24:04 -0500 Subject: [PATCH] Initial notes for 8000 slides on resize. --- 2019-kvm-forum/notes-08-resize | 115 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 2019-kvm-forum/notes-08-resize diff --git a/2019-kvm-forum/notes-08-resize b/2019-kvm-forum/notes-08-resize new file mode 100644 index 0000000..c30ef81 --- /dev/null +++ b/2019-kvm-forum/notes-08-resize @@ -0,0 +1,115 @@ +Where to go from here: Adding resize to NBD +[About 4-5 mins] + +- based somewhat on https://lists.debian.org/nbd/2017/01/msg00016.html + +* Heading: Bigger is better? +- 8000- slide + - qemu -> (raw) -> qemu-nbd -> (qcow2) -> image.qcow2 + - qemu -> (qcow2) -> qemu-nbd -> (raw) -> image.qcow2 + +XXX With all the things we've added to NBD, what do we want to add +next? Our biggest goal (pardon the pun) is to allow dynamic growth of +image sizes. + +There are two ways to consume qcow2 images over NBD. In the first, +the server reads the qcow2 file and exposes only the raw guest-visible +content to the client. If the guest writes a lot, the server may grow +the .qcow2 file as needed, but the guest cannot change the size of the +guest-visible address range, and cannot access any qcow2 features such +as backing files, dirty bitmaps, or internal snapshots. + +In the second, the server exposes the qcow2 file as-is, and the client +must then parse that metadata into guest content. The client now has +access to all qcow2 features (including the QMP block_resize command +for altering the size reported to the guest). However, it cannot +change the size of the underlying .qcow2 container; if more guest +writes and metadata actions occur than the original server size +supports, the operation fails with ENOSPC. Use of preallocation can +work around this limitation, but it is painful enough to pre-size +things correctly that current documentation recommends always running +in the first mode (raw over the wire) rather than this mode (qcow2 +over the wire). + +The next few slides will discuss design tradeoffs to be considered +when adding a resize extension. + +* Heading: Automatic or explicit +- 8100- slide + - automatic: NBD_CMD_WRITE past EOF -> server auto-resizes if possible + - explicit: NBD_CMD_WITE past EOF fails, NBD_CMD_RESIZE to update, + NBD_CMD_WRITE now succeeds. + +POSIX files support automatic growth, insofar as the underlying file +system still has room. However, block devices do not. Should NBD +require an explicit NBD_CMD_RESIZE before allowing access to +additional size, or can NBD_CMD_WRITE extending past EOF trigger an +automatic resize? Should we guarantee zero contents, or may a server +to have unspecified contents in not-yet-written offsets added by a +resize? If resize can be automatic, should the server advertise this +capability to the client? Or should automatic resize be something the +client must opt in to using? + +* Heading: Simple or structured +- 8200- slide + - simple: NBD_CMD_RESIZE -> simple reply + - structured: NBD_CMD_RESIZE -> NBD_REPLY_CHUNK_SIZE+DONE + +Sometimes, the client knows when it needs more space, and wants to +inform the server about a new requested size (this includes the case +when resize is automatic). But even when the client requests one +size, the server may pick a different one (due to rounding to +granularities or to quotas). In other setups, the server can't resize +on the fly at the request of the client, but can be resized by other +means and will thus need a way for the client to learn whether the +size has changed. However, returning the server's notion of the +current size requires a structured reply; servers that lack structured +replies would be limited to a boolean success or failure result. Is +it worth requiring structured replies to implement a resize command? + +* Heading: Polling or notification +- 8300- slide + - NBD_CMD_RESIZE(FLAG_NOTIFY) -> NBD_REPLY_CHUNK_RESIZE+NOT_DONE + -> NBD_REPLY_CHUNK_RESIZE+NOT_DONE ... + +If resize is automatic, or if the server supports external means for +resizing, the client will want some way to learn the server's current +size. The NBD protocol currently requires that all traffic be +command/response pairs initiated by the client, with no means for the +server to initiate a message unrequested by the client. However, as +just mentioned, getting a size back would already require a structured +reply, and structured replies allow the server to send back more than +one response before declaring the response complete. Is it worth +setting up a command flag where the client can request subsequent +notification of size changes as an open-ended request (perhaps +good-until-canceled), where the server can then send replies to that +command as needed on each size change, to allow the client to have a +means to receive events rather than having to periodically poll for +size changes? Do we need to think about a client having to prevent +against a denial of service from a malicious server that sends too +many responses? + +* Heading: Complexity tradeoffs +- 8400- + +Should we specify all of the previous choices, with appropriate +handshaking for each knob? Integration testing becomes more difficult +the more knobs there are to test against. On the other hand, +additional flexibility allows for more servers to support as much or +as little as easily possible, which has already been proven a +worthwhile model with nbdkit plugins. Requiring support for +structured replies may be necessary for some features (such as server +notification), but is definitely overkill for an implementation where +polling is adequate. + +As with fast zeroes, the way forward will be to implement something +that works in each of qemu, nbdkit, and libnbd, and show that they are +interoperable, so that the NBD protocol specification can then +document how other implementations may also interoperably add the same +support. + +* conclusion: XXX +- 9000- wrapup + +Thanks for your time this afternoon. We hope this has been +informative, and welcome any questions at this time. -- 1.8.3.1