Disk Image Pipelines
February 15th 2021
======================================================================

Introduction
----------------------------------------------------------------------

Virt-v2v is a project for "lifting and shifting" workloads from
proprietary VMware systems to open source management platforms like
RHV/oVirt, Open Stack and CNV/KubeVirt.  To do this we have to copy
vast amounts of data quickly, modifying it in flight.

Nearly everything we have to copy is a virtual machine disk image of
some kind, and there are particular techniques you can use to copy
these very efficiently:

 - without copying zeroes or deleted data

 - without making temporary copies

 - modifying the contents in flight

 - without touching the originals

To those working in the virtualization space, all the techniques I'm
going to describe will be quite well-known and obvious.  But I see
other projects making the same mistakes over and over.


Simple copying
----------------------------------------------------------------------

Let's start with something trivial: Let's boot a virtual machine from
a local disk image.  Normally I'd say use libvirt, but for this demo
I'm going to get very simple and run qemu directly.

DIAGRAM:

  local file -> qemu

  qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
                     -m 2048 -display none \
                     -drive file=fedora-33.img,format=raw,if=virtio \
                     -serial stdio


DIAGRAM:

         ssh
  file -------> qemu

A lesser-known fact about qemu is that it contains an SSH client so
you can boot from a remote file over SSH:

COMMAND:

  qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
                     -m 2048 -display none \
                     -drive file=ssh://kool/mnt/scratch/pipes/fedora-33.img,format=raw,if=virtio \
                     -serial stdio


Snapshots
----------------------------------------------------------------------

DIAGRAM:

         ssh
  file -------> snapshot ------> qemu

The command I just showed you opened the remote file for writes.  If
we want to prevent modifications to the remote file, we can place a
snapshot into the path.  A snapshot in this case is a qcow2 file with
the backing file set to the SSH URL.  Any modifications we make are
saved into the snapshot.  The original disk is untouched.

COMMAND:

  qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
                     -m 2048 -display none \
                     -drive file=snapshot.qcow2,format=qcow2,if=virtio \
                     -serial stdio


Copying
----------------------------------------------------------------------

Instead of booting the disk, let's make a full local copy:

COMMAND:

  qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p


Disk images
----------------------------------------------------------------------

DIAGRAM:

  [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]

Now let's take side-step to talk about what's inside disk images.

Firstly disk images are often HUGE.  A BluRay movie is 50 gigabytes,
but that's really nothing compared to the larger disk images that we
move about when we move to KVM.  Those can be hundreds of gigabytes or
terabytes, and we move hundreds of them in a single batch.

But the good news is that these disk images are often quite sparse.
They may contain much less actual data than the virtual size.
A lot of the disk may be filled with zeroes.

But the bad news is that virtual machines that have been running for a
long time accumulate lots of deleted files and other stuff that isn't
needed by the operating system but also isn't zeroes.


Disk metadata
----------------------------------------------------------------------

DIAGRAM:

  [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
   < allocated >    < allocated         > < hole     >
              < hole >

What a lot of people don't know about disk images is there's another
part to them - the metadata.  This records which parts of the disk
image are allocated, and while parts are "holes".

Because less-experienced system administrators don't know about this,
the metadata often gets lost when files are copied around.

For long-running virtual machines, deleted data may often still be
allocated (although this depends on how the VM is set up).

Some tools you can use to study the metadata of files:

  ls -lsh
  filefrag
  qemu-img map
  nbdinfo --map


Sparsification
----------------------------------------------------------------------

DIAGRAM:

  [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
   < allocated >    < allocated         > < hole     >
              < hole >

           |
           v

  [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
   < allocated >             < allocated >< hole     >
              < hole         >


We can cope with both of these things.  The technique is called
"sparsification".  Some tools you can use to sparsify a disk are:

  fstrim
  virt-sparsify --in-place

Sparsification part 2
----------------------------------------------------------------------

         ssh
  file -------> snapshot  <------ virt-sparsify
                          ------> qemu-img convert
                  ^
                  |
      zero clusters are saved in here

I'm going to take the same scenario as before, but use
sparsification before doing the copy.

(Run these commands and show the output and ls of the snapshot)


Benchmark A
----------------------------------------------------------------------

Now you might think this is all a bit obscure, but is it any good?
In this first benchmark, I've compared copying a disk in several
different ways to see which is fastest.  All of the copying happens
between two idle machines, over a slow network.

The full methodology is in the background notes that accompany this
talk, which I'll link at the end.

  scp             scp remote:fedora-33.img local.img

                       ssh
  sparsify        file -> qcow2 snapshot <- virt-sparsify
                                         -> qemu-img convert

  without         (as above but without sparsifying)
  sparsify

                       ssh
  nbdcopy         file -> nbdkit cow filter <- virt-sparsify
                                            -> nbdcopy

Which do you think will be faster?


Benchmark A results
----------------------------------------------------------------------

(Same slides with timings added)

Lock contention in the cow filter is thought to be the
reason for the poor performance of nbdkit + nbdcopy.


Opening OVA files
----------------------------------------------------------------------

DIAGRAM:

  guest.ova -----> tar-filter <- virt-sparsify
                              -> qemu-img convert

  guest.ova------------+
  | guest.ovf          |
  | disk1.raw|vmdk     |
  +--------------------+

  tar file =  header - file - header - file - ...

This technique isn't just useful for remote files.  Another trick we
use in virt-v2v is using an nbdkit filter to unpack VMware's OVA files
without any copies.  OVA files are really uncompressed tar files.  The
disk inside can be in a variety of formats, often raw or VMDK.

We can ask the 'tar' command to give us the offset and size of the
disk image within the file and simply read it out of the file
directly.


Benchmark B
----------------------------------------------------------------------

  cp test.ova test2.ova

  tar xf test.ova fedora-33.img


   nbdkit -> tar filter <- sparsify
                        -> qemu-img convert

  nbdkit -f --exit-with-parent --filter=tar file test.ova tar-entry=fedora-33.img
  qemu-img create -f qcow2 -b nbd:localhost:10809 snapshot.qcow2
  virt-sparsify --inplace snapshot.qcow2
  qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img

Which is faster?

Benchmark B results
----------------------------------------------------------------------

(Same as above, with results)

The results are interesting, but if you remember what we said about
the disk format and sparsification then it shouldn't be surprising.

The copy and tar commands have to churn through the entire
disk image - zeroes and deleted files.

With nbdkit, sparsification and qemu-img convert we only copy a
fraction of the data.

Note the two methods do NOT produce bit-for-bit equivalent outputs.
Q: Is this a problem?
A: No different from if the owner of the VM had run "fstrim".


Modifications
----------------------------------------------------------------------

Virt-v2v doesn't only make efficient copies, it also modifies the disk
image in flight.  Some kinds of modifications that are made:

 - installing virtio drivers

 - removing VMware tools

 - modifying the bootloader

 - rebuilding initramfs

 - changing device names in /etc files

 - changing the Windows registry

 - (and much more)

These are significant modifications, and they happen entirely during
the transfer, without touching the source and without making large
temporary copies.

I'm not going to talk about this in great detail because it's a very
complex topic.  Instead I will show you a simple demonstration of a
similar technique.

DIAGRAM:

  (Screenshot from https://alt.fedoraproject.org/cloud/)

  HTTPS
  -----> nbdkit-curl-plugin --> xz filter --> qcow2 snapshot
     <-- sparsify
     <-- deactivate cloud-init
     <-- write a file
     --> qemu-img convert

DEMO:

  nbdkit curl https://download.fedoraproject.org/pub/fedora/linux/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz --filter=xz
  qemu-img create -f qcow2 -b nbd://localhost -F raw snapshot.qcow2
  virt-sparsify --inplace snapshot.qcow2
  virt-customize -a snapshot.qcow2 \
                 --run-command 'systemctl disable cloud-init' \
                 --write /hello:HELLO
  ls -lsh snapshot.qcow2
  qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
  guestfish --ro -a local.img -i ll /


Complete virt-v2v pipelines
----------------------------------------------------------------------

DIAGRAM:

         proprietary
         transport
  VMware -----> nbdkit ----> nbdkit ----> qcow2
  ESXi          vddk         rate         snapshot
                plugin       filter

  qcow2    <---- sparsify
  snapshot <---- install drivers
           -----> qemu-img convert

                    nbd          HTTPS
  qemu-img convert ----> nbdkit  -----> imageio
                         python
                         plugin

Discuss:

 - separate input and output sides

 - NBD used extensively

 - very efficient and no large temporary copies

 - virt-v2v may be on a separate machine

 - rate filter

 - many other tricks used


Conclusions
----------------------------------------------------------------------

Disk image pipelines:

 - efficient

 - flexible

 - avoid local copies

 - avoid copying zeroes/sparseness/deleted data

 - sparsification

 - modifications in flight


Future work / other topics
----------------------------------------------------------------------

nbdcopy vs qemu-img convert

copy-on-read, bounded caches

block size adjustment

reading from containers

stop using gzip!


References
----------------------------------------------------------------------

http://git.annexia.org/?p=libguestfs-talks.git;a=tree;f=2021-pipelines

https://gitlab.com/nbdkit

https://libguestfs.org

https://libguestfs.org/virt-v2v.1.html