Disk Image Pipelines February 15th 2021 ====================================================================== Introduction ---------------------------------------------------------------------- Virt-v2v is a project for "lifting and shifting" workloads from proprietary VMware systems to open source management platforms like RHV/oVirt, Open Stack and CNV/KubeVirt. To do this we have to copy vast amounts of data quickly, modifying it in flight. Nearly everything we have to copy is a virtual machine disk image of some kind, and there are particular techniques you can use to copy these very efficiently: - without copying zeroes or deleted data - without making temporary copies - modifying the contents in flight - without touching the originals To those working in the virtualization space, all the techniques I'm going to describe will be quite well-known and obvious. But I see other projects making the same mistakes over and over. Simple copying ---------------------------------------------------------------------- Let's start with something trivial: Let's boot a virtual machine from a local disk image. Normally I'd say use libvirt, but for this demo I'm going to get very simple and run qemu directly. DIAGRAM: local file -> qemu qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 -display none \ -drive file=fedora-33.img,format=raw,if=virtio \ -serial stdio DIAGRAM: ssh file -------> qemu A lesser-known fact about qemu is that it contains an SSH client so you can boot from a remote file over SSH: COMMAND: qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 -display none \ -drive file=ssh://kool/mnt/scratch/pipes/fedora-33.img,format=raw,if=virtio \ -serial stdio Snapshots ---------------------------------------------------------------------- DIAGRAM: ssh file -------> snapshot ------> qemu The command I just showed you opened the remote file for writes. If we want to prevent modifications to the remote file, we can place a snapshot into the path. A snapshot in this case is a qcow2 file with the backing file set to the SSH URL. Any modifications we make are saved into the snapshot. The original disk is untouched. COMMAND: qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2 qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 -display none \ -drive file=snapshot.qcow2,format=qcow2,if=virtio \ -serial stdio Copying ---------------------------------------------------------------------- Instead of booting the disk, let's make a full local copy: COMMAND: qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2 qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p Disk images ---------------------------------------------------------------------- DIAGRAM: [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ] Now let's take side-step to talk about what's inside disk images. Firstly disk images are often HUGE. A BluRay movie is 50 gigabytes, but that's really nothing compared to the larger disk images that we move about when we move to KVM. Those can be hundreds of gigabytes or terabytes, and we move hundreds of them in a single batch. But the good news is that these disk images are often quite sparse. They may contain much less actual data than the virtual size. A lot of the disk may be filled with zeroes. But the bad news is that virtual machines that have been running for a long time accumulate lots of deleted files and other stuff that isn't needed by the operating system but also isn't zeroes. Disk metadata ---------------------------------------------------------------------- DIAGRAM: [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ] < allocated > < allocated > < hole > < hole > What a lot of people don't know about disk images is there's another part to them - the metadata. This records which parts of the disk image are allocated, and while parts are "holes". Because less-experienced system administrators don't know about this, the metadata often gets lost when files are copied around. For long-running virtual machines, deleted data may often still be allocated (although this depends on how the VM is set up). Some tools you can use to study the metadata of files: ls -lsh filefrag qemu-img map nbdinfo --map Sparsification ---------------------------------------------------------------------- DIAGRAM: [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ] < allocated > < allocated > < hole > < hole > | v [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ] < allocated > < allocated >< hole > < hole > We can cope with both of these things. The technique is called "sparsification". Some tools you can use to sparsify a disk are: fstrim virt-sparsify --in-place Sparsification part 2 ---------------------------------------------------------------------- ssh file -------> snapshot <------ virt-sparsify ------> qemu-img convert ^ | zero clusters are saved in here I'm going to take the same scenario as before, but use sparsification before doing the copy. (Run these commands and show the output and ls of the snapshot) Benchmark A ---------------------------------------------------------------------- Now you might think this is all a bit obscure, but is it any good? In this first benchmark, I've compared copying a disk in several different ways to see which is fastest. All of the copying happens between two idle machines, over a slow network. The full methodology is in the background notes that accompany this talk, which I'll link at the end. scp scp remote:fedora-33.img local.img ssh sparsify file -> qcow2 snapshot <- virt-sparsify -> qemu-img convert without (as above but without sparsifying) sparsify ssh nbdcopy file -> nbdkit cow filter <- virt-sparsify -> nbdcopy Which do you think will be faster? Benchmark A results ---------------------------------------------------------------------- (Same slides with timings added) Lock contention in the cow filter is thought to be the reason for the poor performance of nbdkit + nbdcopy. Opening OVA files ---------------------------------------------------------------------- DIAGRAM: guest.ova -----> tar-filter <- virt-sparsify -> qemu-img convert guest.ova------------+ | guest.ovf | | disk1.raw|vmdk | +--------------------+ tar file = header - file - header - file - ... This technique isn't just useful for remote files. Another trick we use in virt-v2v is using an nbdkit filter to unpack VMware's OVA files without any copies. OVA files are really uncompressed tar files. The disk inside can be in a variety of formats, often raw or VMDK. We can ask the 'tar' command to give us the offset and size of the disk image within the file and simply read it out of the file directly. Benchmark B ---------------------------------------------------------------------- cp test.ova test2.ova tar xf test.ova fedora-33.img nbdkit -> tar filter <- sparsify -> qemu-img convert nbdkit -f --exit-with-parent --filter=tar file test.ova tar-entry=fedora-33.img qemu-img create -f qcow2 -b nbd:localhost:10809 snapshot.qcow2 virt-sparsify --inplace snapshot.qcow2 qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img Which is faster? Benchmark B results ---------------------------------------------------------------------- (Same as above, with results) The results are interesting, but if you remember what we said about the disk format and sparsification then it shouldn't be surprising. The copy and tar commands have to churn through the entire disk image - zeroes and deleted files. With nbdkit, sparsification and qemu-img convert we only copy a fraction of the data. Note the two methods do NOT produce bit-for-bit equivalent outputs. Q: Is this a problem? A: No different from if the owner of the VM had run "fstrim". Modifications ---------------------------------------------------------------------- Virt-v2v doesn't only make efficient copies, it also modifies the disk image in flight. Some kinds of modifications that are made: - installing virtio drivers - removing VMware tools - modifying the bootloader - rebuilding initramfs - changing device names in /etc files - changing the Windows registry - (and much more) These are significant modifications, and they happen entirely during the transfer, without touching the source and without making large temporary copies. I'm not going to talk about this in great detail because it's a very complex topic. Instead I will show you a simple demonstration of a similar technique. DIAGRAM: (Screenshot from https://alt.fedoraproject.org/cloud/) HTTPS -----> nbdkit-curl-plugin --> xz filter --> qcow2 snapshot <-- sparsify <-- deactivate cloud-init <-- write a file --> qemu-img convert DEMO: nbdkit curl https://download.fedoraproject.org/pub/fedora/linux/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz --filter=xz qemu-img create -f qcow2 -b nbd://localhost -F raw snapshot.qcow2 virt-sparsify --inplace snapshot.qcow2 virt-customize -a snapshot.qcow2 \ --run-command 'systemctl disable cloud-init' \ --write /hello:HELLO ls -lsh snapshot.qcow2 qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p guestfish --ro -a local.img -i ll / Complete virt-v2v pipelines ---------------------------------------------------------------------- DIAGRAM: proprietary transport VMware -----> nbdkit ----> nbdkit ----> qcow2 ESXi vddk rate snapshot plugin filter qcow2 <---- sparsify snapshot <---- install drivers -----> qemu-img convert nbd HTTPS qemu-img convert ----> nbdkit -----> imageio python plugin Discuss: - separate input and output sides - NBD used extensively - very efficient and no large temporary copies - virt-v2v may be on a separate machine - rate filter - many other tricks used Conclusions ---------------------------------------------------------------------- Disk image pipelines: - efficient - flexible - avoid local copies - avoid copying zeroes/sparseness/deleted data - sparsification - modifications in flight Future work / other topics ---------------------------------------------------------------------- nbdcopy vs qemu-img convert copy-on-read, bounded caches block size adjustment reading from containers stop using gzip! References ---------------------------------------------------------------------- http://git.annexia.org/?p=libguestfs-talks.git;a=tree;f=2021-pipelines https://gitlab.com/nbdkit https://libguestfs.org https://libguestfs.org/virt-v2v.1.html