2021-pipelines/notes.txt

   1 Disk Image Pipelines
   2 February 15th 2021
   3 ======================================================================
   4
   5 Introduction
   6 ----------------------------------------------------------------------
   7
   8 Virt-v2v is a project for "lifting and shifting" workloads from
   9 proprietary VMware systems to open source management platforms like
  10 RHV/oVirt, Open Stack and CNV/KubeVirt.  To do this we have to copy
  11 vast amounts of data quickly, modifying it in flight.
  12
  13 Nearly everything we have to copy is a virtual machine disk image of
  14 some kind, and there are particular techniques you can use to copy
  15 these very efficiently:
  16
  17  - without copying zeroes or deleted data
  18
  19  - without making temporary copies
  20
  21  - modifying the contents in flight
  22
  23  - without touching the originals
  24
  25 To those working in the virtualization space, all the techniques I'm
  26 going to describe will be quite well-known and obvious.  But I see
  27 other projects making the same mistakes over and over.
  28
  29
  30 Simple copying
  31 ----------------------------------------------------------------------
  32
  33 Let's start with something trivial: Let's boot a virtual machine from
  34 a local disk image.  Normally I'd say use libvirt, but for this demo
  35 I'm going to get very simple and run qemu directly.
  36
  37 DIAGRAM:
  38
  39   local file -> qemu
  40
  41   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  42                      -m 2048 -display none \
  43                      -drive file=fedora-33.img,format=raw,if=virtio \
  44                      -serial stdio
  45
  46
  47 DIAGRAM:
  48
  49          ssh
  50   file -------> qemu
  51
  52 A lesser-known fact about qemu is that it contains an SSH client so
  53 you can boot from a remote file over SSH:
  54
  55 COMMAND:
  56
  57   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  58                      -m 2048 -display none \
  59                      -drive file=ssh://kool/mnt/scratch/pipes/fedora-33.img,format=raw,if=virtio \
  60                      -serial stdio
  61
  62
  63 Snapshots
  64 ----------------------------------------------------------------------
  65
  66 DIAGRAM:
  67
  68          ssh
  69   file -------> snapshot ------> qemu
  70
  71 The command I just showed you opened the remote file for writes.  If
  72 we want to prevent modifications to the remote file, we can place a
  73 snapshot into the path.  A snapshot in this case is a qcow2 file with
  74 the backing file set to the SSH URL.  Any modifications we make are
  75 saved into the snapshot.  The original disk is untouched.
  76
  77 COMMAND:
  78
  79   qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  80   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  81                      -m 2048 -display none \
  82                      -drive file=snapshot.qcow2,format=qcow2,if=virtio \
  83                      -serial stdio
  84
  85
  86 Copying
  87 ----------------------------------------------------------------------
  88
  89 Instead of booting the disk, let's make a full local copy:
  90
  91 COMMAND:
  92
  93   qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  94   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
  95
  96
  97
  98 Disk images
  99 ----------------------------------------------------------------------
 100
 101 DIAGRAM:
 102
 103   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 104
 105 Now let's take side-step to talk about what's inside disk images.
 106
 107 Firstly disk images are often HUGE.  A BluRay movie is 50 gigabytes,
 108 but that's really nothing compared to the larger disk images that we
 109 move about when we move to KVM.  Those can be hundreds of gigabytes or
 110 terabytes, and we move hundreds of them in a single batch.
 111
 112 But the good news is that these disk images are often quite sparse.
 113 They may contain much less actual data than the virtual size.
 114 A lot of the disk may be filled with zeroes.
 115
 116 But the bad news is that virtual machines that have been running for a
 117 long time accumulate lots of deleted files and other stuff that isn't
 118 needed by the operating system but also isn't zeroes.
 119
 120
 121 Disk metadata
 122 ----------------------------------------------------------------------
 123
 124 DIAGRAM:
 125
 126   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 127    < allocated >    < allocated         > < hole     >
 128               < hole >
 129
 130 What a lot of people don't know about disk images is there's another
 131 part to them - the metadata.  This records which parts of the disk
 132 image are allocated, and while parts are "holes".
 133
 134 Because less-experienced system administrators don't know about this,
 135 the metadata often gets lost when files are copied around.
 136
 137 For long-running virtual machines, deleted data may often still be
 138 allocated (although this depends on how the VM is set up).
 139
 140 Some tools you can use to study the metadata of files:
 141
 142   ls -lsh
 143   filefrag
 144   qemu-img map
 145   nbdinfo --map
 146
 147
 148 Sparsification
 149 ----------------------------------------------------------------------
 150
 151 DIAGRAM:
 152
 153   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 154    < allocated >    < allocated         > < hole     >
 155               < hole >
 156
 157            |
 158            v
 159
 160   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 161    < allocated >             < allocated >< hole     >
 162               < hole         >
 163
 164
 165 We can cope with both of these things.  The technique is called
 166 "sparsification".  Some tools you can use to sparsify a disk are:
 167
 168   fstrim
 169   virt-sparsify --in-place
 170
 171 Sparsification part 2
 172 ----------------------------------------------------------------------
 173
 174          ssh
 175   file -------> snapshot  <------ virt-sparsify
 176                           ------> qemu-img convert
 177                   ^
 178                   |
 179       zero clusters are saved in here
 180
 181 I'm going to take the same scenario as before, but use
 182 sparsification before doing the copy.
 183
 184 (Run these commands and show the output and ls of the snapshot)
 185
 186
 187
 188 Benchmark A
 189 ----------------------------------------------------------------------
 190
 191 Now you might think this is all a bit obscure, but how does it apply
 192 to copying disk images.  In this first benchmark, I've compared
 193 copying a disk in several different ways to see which is fastest.  All
 194 of the copying happens between two idle machines, over a slow network.
 195
 196 The full methodology is in the background notes that accompany this
 197 talk, which I'll link at the end.
 198
 199   scp             scp remote:fedora-33.img local.img
 200
 201                        ssh
 202   sparsify        file -> qcow2 snapshot <- virt-sparsify
 203                                          -> qemu-img convert
 204
 205   without         (as above but without sparsifying)
 206   sparsify
 207
 208                        ssh
 209   nbdcopy         file -> nbdkit cow filter <- virt-sparsify
 210                                             -> nbdcopy
 211
 212 Which do you think will be faster?
 213
 214
 215 Benchmark A results
 216 ----------------------------------------------------------------------
 217
 218 (Same slides with timings added)
 219
 220 Lock contention in the cow filter is thought to be the
 221 reason for the poor performance of nbdkit + nbdcopy.
 222
 223
 224 Opening OVA files
 225 ----------------------------------------------------------------------
 226
 227 DIAGRAM:
 228
 229   guest.ova -----> tar-filter <- virt-sparsify
 230                               -> qemu-img convert
 231
 232   guest.ova------------+
 233   | guest.ovf          |
 234   | disk1.raw|vmdk     |
 235   +--------------------+
 236
 237   tar file =  header - file - header - file - ...
 238
 239 This technique isn't just useful for remote files.  Another trick we
 240 use in virt-v2v is using an nbdkit filter to unpack VMware's OVA files
 241 without any copies.  OVA files are really uncompressed tar files.  The
 242 disk inside can be in a variety of formats, often raw or VMDK.
 243
 244 We can ask the 'tar' command to give us the offset and size of the
 245 disk image within the file and simply read it out of the file
 246 directly.
 247
 248
 249 Benchmark B
 250 ----------------------------------------------------------------------
 251
 252   cp test.ova test2.ova
 253
 254   tar xf test.ova fedora-33.img
 255
 256
 257    nbdkit -> tar filter <- sparsify
 258                         -> qemu-img convert
 259
 260   nbdkit -f --exit-with-parent --filter=tar file test.ova tar-entry=fedora-33.img
 261   qemu-img create -f qcow2 -b nbd:localhost:10809 snapshot.qcow2
 262   virt-sparsify --inplace snapshot.qcow2
 263   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img
 264
 265 Which is faster?
 266
 267 Benchmark B results
 268 ----------------------------------------------------------------------
 269
 270 (Same as above, with results)
 271
 272 The results are interesting, but if you remember what we said about
 273 the disk format and sparsification then it shouldn't be surprising.
 274
 275 The copy and tar commands have to churn through the entire
 276 disk image - zeroes and deleted files.
 277
 278 With nbdkit, sparsification and qemu-img convert we only copy a
 279 fraction of the data.
 280
 281 Note the two methods do NOT produce bit-for-bit equivalent outputs.
 282 Q: Is this a problem?
 283 A: No different from if the owner of the VM had run "fstrim".
 284
 285
 286 Modifications
 287 ----------------------------------------------------------------------
 288
 289 Virt-v2v doesn't only make efficient copies, it also modifies the disk
 290 image in flight.  Some kinds of modifications that are made:
 291
 292  - installing virtio drivers
 293
 294  - removing VMware tools
 295
 296  - modifying the bootloader
 297
 298  - rebuilding initramfs
 299
 300  - changing device names in /etc files
 301
 302  - changing the Windows registry
 303
 304  - (and much more)
 305
 306 These are significant modifications, and they happen entirely during
 307 the transfer, without touching the source and without making large
 308 temporary copies.
 309
 310 I'm not going to talk about this in great detail because it's a very
 311 complex topic.  Instead I will show you a simple demonstration of a
 312 similar technique.
 313
 314   qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
 315   virt-sparsify --inplace snapshot.qcow2
 316   guestfish -a snapshot.qcow2 -i write /etc/motd 'HEY, IT WORKED!'
 317   ls -lh snapshot.qcow2
 318   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
 319   virt-cat -a local.img /etc/motd
 320
 321 (Show this as a demo.  Show original untouched)
 322
 323
 324 Complete virt-v2v paths
 325 ----------------------------------------------------------------------
 326
 327 DIAGRAM:
 328
 329          proprietary
 330          transport
 331   VMware -----> nbdkit ----> nbdkit ----> qcow2
 332   ESXi          vddk         rate         snapshot
 333                 plugin       filter
 334
 335   qcow2    <---- sparsify
 336   snapshot <---- install drivers
 337            -----> qemu-img convert
 338
 339                     nbd          HTTPS
 340   qemu-img convert ----> nbdkit  -----> imageio
 341                          python
 342                          plugin
 343
 344 Discuss:
 345
 346  - separate input and output sides
 347
 348  - NBD used extensively
 349
 350  - very efficient and no large temporary copies
 351
 352  - rate filter
 353
 354  - many other tricks used
 355
 356
 357 Streaming and modifying a compressed image
 358 ----------------------------------------------------------------------
 359
 360 DIAGRAM:
 361
 362   (Screenshot from https://alt.fedoraproject.org/cloud/)
 363
 364   HTTPS
 365   -----> nbdkit-curl-plugin --> xz filter --> qcow2 snapshot
 366      <-- sparsify
 367      <-- deactivate cloud-init
 368      <-- write a file
 369      --> qemu-img convert
 370
 371 DEMO:
 372
 373   nbdkit curl https://download.fedoraproject.org/pub/fedora/linux/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz --filter=xz
 374   qemu-img create -f qcow2 -b nbd://localhost -F raw snapshot.qcow2
 375   virt-sparsify --inplace snapshot.qcow2
 376   virt-customize -a snapshot.qcow2 \
 377                  --run-command 'systemctl disable cloud-init' \
 378                  --write /hello:HELLO
 379   ls -lsh snapshot.qcow2
 380   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
 381   guestfish --ro -a local.img -i ll /
 382
 383
 384 Conclusions
 385 ----------------------------------------------------------------------
 386
 387 Disk image pipelines:
 388
 389  - efficient
 390
 391  - flexible
 392
 393  - avoid local copies
 394
 395  - avoid copying zeroes/sparseness/deleted data
 396
 397  - sparsification
 398
 399  - modifications in flight
 400
 401
 402 Future work / other topics
 403 ----------------------------------------------------------------------
 404
 405 nbdcopy vs qemu-img convert
 406
 407 copy-on-read, bounded caches
 408
 409 block size adjustment
 410
 411 reading from containers
 412
 413 stop using gzip!
 414
 415
 416 References
 417 ----------------------------------------------------------------------
 418
 419 http://git.annexia.org/?p=libguestfs-talks.git;a=tree;f=2021-pipelines
 420
 421 https://gitlab.com/nbdkit
 422
 423 https://libguestfs.org
 424
 425 https://libguestfs.org/virt-v2v.1.html