2021-pipelines/notes.txt

   1 Disk Image Pipelines
   2 February 15th 2021
   3 ======================================================================
   4
   5 Introduction
   6 ----------------------------------------------------------------------
   7
   8 Virt-v2v is a project for "lifting and shifting" workloads from
   9 proprietary VMware systems to open source management platforms like
  10 RHV/oVirt, Open Stack and CNV/KubeVirt.  To do this we have to copy
  11 vast amounts of data quickly, modifying it in flight.
  12
  13 Nearly everything we have to copy is a virtual machine disk image of
  14 some kind, and there are particular techniques you can use to copy
  15 these very efficiently:
  16
  17  - without copying zeroes or deleted data
  18
  19  - without making temporary copies
  20
  21  - modifying the contents in flight
  22
  23  - without touching the originals
  24
  25 To those working in the virtualization space, all the techniques I'm
  26 going to describe will be quite well-known and obvious.  But I see
  27 other projects making the same mistakes over and over.
  28
  29
  30 Simple copying
  31 ----------------------------------------------------------------------
  32
  33 Let's start with something trivial: Let's boot a virtual machine from
  34 a local disk image.  Normally I'd say use libvirt, but for this demo
  35 I'm going to get very simple and run qemu directly.
  36
  37 DIAGRAM:
  38
  39   local file -> qemu
  40
  41   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  42                      -m 2048 -display none \
  43                      -drive file=fedora-33.img,format=raw,if=virtio \
  44                      -serial stdio
  45
  46
  47 DIAGRAM:
  48
  49          ssh
  50   file -------> qemu
  51
  52 A lesser-known fact about qemu is that it contains an SSH client so
  53 you can boot from a remote file over SSH:
  54
  55 COMMAND:
  56
  57   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  58                      -m 2048 -display none \
  59                      -drive file=ssh://kool/mnt/scratch/pipes/fedora-33.img,format=raw,if=virtio \
  60                      -serial stdio
  61
  62
  63 Snapshots
  64 ----------------------------------------------------------------------
  65
  66 DIAGRAM:
  67
  68          ssh
  69   file -------> snapshot ------> qemu
  70
  71 The command I just showed you opened the remote file for writes.  If
  72 we want to prevent modifications to the remote file, we can place a
  73 snapshot into the path.  A snapshot in this case is a qcow2 file with
  74 the backing file set to the SSH URL.  Any modifications we make are
  75 saved into the snapshot.  The original disk is untouched.
  76
  77 COMMAND:
  78
  79   qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  80   qemu-system-x86_64 -machine accel=kvm:tcg -cpu host
  81                      -m 2048 -display none \
  82                      -drive file=snapshot.qcow2,format=qcow2,if=virtio \
  83                      -serial stdio
  84
  85
  86 Copying
  87 ----------------------------------------------------------------------
  88
  89 Instead of booting the disk, let's make a full local copy:
  90
  91 COMMAND:
  92
  93   qemu-img create -f qcow2 -b ssh://kool/mnt/scratch/pipes/fedora-33.img snapshot.qcow2
  94   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
  95
  96
  97
  98 Disk images
  99 ----------------------------------------------------------------------
 100
 101 DIAGRAM:
 102
 103   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 104
 105 Now let's take side-step to talk about what's inside disk images.
 106
 107 Firstly disk images are often HUGE.  A BluRay movie is 50 gigabytes,
 108 but that's really nothing compared to the larger disk images that we
 109 move about when we move to KVM.  Those can be hundreds of gigabytes or
 110 terabytes, and we move hundreds of them in a single batch.
 111
 112 But the good news is that these disk images are often quite sparse.
 113 They may contain much less actual data than the virtual size.
 114 A lot of the disk may be filled with zeroes.
 115
 116 But the bad news is that virtual machines that have been running for a
 117 long time accumulate lots of deleted files and other stuff that isn't
 118 needed by the operating system but also isn't zeroes.
 119
 120
 121 Disk metadata
 122 ----------------------------------------------------------------------
 123
 124 DIAGRAM:
 125
 126   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 127    < allocated >    < allocated         > < hole     >
 128               < hole >
 129
 130 What a lot of people don't know about disk images is there's another
 131 part to them - the metadata.  This records which parts of the disk
 132 image are allocated, and while parts are "holes".
 133
 134 Because less-experienced system administrators don't know about this,
 135 the metadata often gets lost when files are copied around.
 136
 137 For long-running virtual machines, deleted data may often still be
 138 allocated (although this depends on how the VM is set up).
 139
 140 Some tools you can use to study the metadata of files:
 141
 142   ls -lsh
 143   filefrag
 144   qemu-img map
 145   nbdinfo --map
 146
 147
 148 Sparsification
 149 ----------------------------------------------------------------------
 150
 151 DIAGRAM:
 152
 153   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 154    < allocated >    < allocated         > < hole     >
 155               < hole >
 156
 157            |
 158            v
 159
 160   [ XXXXXXXXXX .... DDDDDDDD XXXXXXXXXXXX .......... ]
 161    < allocated >             < allocated >< hole     >
 162               < hole         >
 163
 164
 165 We can cope with both of these things.  The technique is called
 166 "sparsification".  Some tools you can use to sparsify a disk are:
 167
 168   fstrim
 169   virt-sparsify --in-place
 170
 171 Sparsification part 2
 172 ----------------------------------------------------------------------
 173
 174          ssh
 175   file -------> snapshot  <------ virt-sparsify
 176                           ------> qemu-img convert
 177                   ^
 178                   |
 179       zero clusters are saved in here
 180
 181 I'm going to take the same scenario as before, but use
 182 sparsification before doing the copy.
 183
 184 (Run these commands and show the output and ls of the snapshot)
 185
 186
 187
 188 Benchmark A
 189 ----------------------------------------------------------------------
 190
 191 Now you might think this is all a bit obscure, but is it any good?
 192 In this first benchmark, I've compared copying a disk in several
 193 different ways to see which is fastest.  All of the copying happens
 194 between two idle machines, over a slow network.
 195
 196 The full methodology is in the background notes that accompany this
 197 talk, which I'll link at the end.
 198
 199   scp             scp remote:fedora-33.img local.img
 200
 201                        ssh
 202   sparsify        file -> qcow2 snapshot <- virt-sparsify
 203                                          -> qemu-img convert
 204
 205   without         (as above but without sparsifying)
 206   sparsify
 207
 208                        ssh
 209   nbdcopy         file -> nbdkit cow filter <- virt-sparsify
 210                                             -> nbdcopy
 211
 212 Which do you think will be faster?
 213
 214
 215 Benchmark A results
 216 ----------------------------------------------------------------------
 217
 218 (Same slides with timings added)
 219
 220 Lock contention in the cow filter is thought to be the
 221 reason for the poor performance of nbdkit + nbdcopy.
 222
 223
 224 Opening OVA files
 225 ----------------------------------------------------------------------
 226
 227 DIAGRAM:
 228
 229   guest.ova -----> tar-filter <- virt-sparsify
 230                               -> qemu-img convert
 231
 232   guest.ova------------+
 233   | guest.ovf          |
 234   | disk1.raw|vmdk     |
 235   +--------------------+
 236
 237   tar file =  header - file - header - file - ...
 238
 239 This technique isn't just useful for remote files.  Another trick we
 240 use in virt-v2v is using an nbdkit filter to unpack VMware's OVA files
 241 without any copies.  OVA files are really uncompressed tar files.  The
 242 disk inside can be in a variety of formats, often raw or VMDK.
 243
 244 We can ask the 'tar' command to give us the offset and size of the
 245 disk image within the file and simply read it out of the file
 246 directly.
 247
 248
 249 Benchmark B
 250 ----------------------------------------------------------------------
 251
 252   cp test.ova test2.ova
 253
 254   tar xf test.ova fedora-33.img
 255
 256
 257    nbdkit -> tar filter <- sparsify
 258                         -> qemu-img convert
 259
 260   nbdkit -f --exit-with-parent --filter=tar file test.ova tar-entry=fedora-33.img
 261   qemu-img create -f qcow2 -b nbd:localhost:10809 snapshot.qcow2
 262   virt-sparsify --inplace snapshot.qcow2
 263   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img
 264
 265 Which is faster?
 266
 267 Benchmark B results
 268 ----------------------------------------------------------------------
 269
 270 (Same as above, with results)
 271
 272 The results are interesting, but if you remember what we said about
 273 the disk format and sparsification then it shouldn't be surprising.
 274
 275 The copy and tar commands have to churn through the entire
 276 disk image - zeroes and deleted files.
 277
 278 With nbdkit, sparsification and qemu-img convert we only copy a
 279 fraction of the data.
 280
 281 Note the two methods do NOT produce bit-for-bit equivalent outputs.
 282 Q: Is this a problem?
 283 A: No different from if the owner of the VM had run "fstrim".
 284
 285
 286 Modifications
 287 ----------------------------------------------------------------------
 288
 289 Virt-v2v doesn't only make efficient copies, it also modifies the disk
 290 image in flight.  Some kinds of modifications that are made:
 291
 292  - installing virtio drivers
 293
 294  - removing VMware tools
 295
 296  - modifying the bootloader
 297
 298  - rebuilding initramfs
 299
 300  - changing device names in /etc files
 301
 302  - changing the Windows registry
 303
 304  - (and much more)
 305
 306 These are significant modifications, and they happen entirely during
 307 the transfer, without touching the source and without making large
 308 temporary copies.
 309
 310 I'm not going to talk about this in great detail because it's a very
 311 complex topic.  Instead I will show you a simple demonstration of a
 312 similar technique.
 313
 314 DIAGRAM:
 315
 316   (Screenshot from https://alt.fedoraproject.org/cloud/)
 317
 318   HTTPS
 319   -----> nbdkit-curl-plugin --> xz filter --> qcow2 snapshot
 320      <-- sparsify
 321      <-- deactivate cloud-init
 322      <-- write a file
 323      --> qemu-img convert
 324
 325 DEMO:
 326
 327   nbdkit curl https://download.fedoraproject.org/pub/fedora/linux/releases/33/Cloud/x86_64/images/Fedora-Cloud-Base-33-1.2.x86_64.raw.xz --filter=xz
 328   qemu-img create -f qcow2 -b nbd://localhost -F raw snapshot.qcow2
 329   virt-sparsify --inplace snapshot.qcow2
 330   virt-customize -a snapshot.qcow2 \
 331                  --run-command 'systemctl disable cloud-init' \
 332                  --write /hello:HELLO
 333   ls -lsh snapshot.qcow2
 334   qemu-img convert -f qcow2 snapshot.qcow2 -O raw local.img -p
 335   guestfish --ro -a local.img -i ll /
 336
 337
 338 Complete virt-v2v paths
 339 ----------------------------------------------------------------------
 340
 341 DIAGRAM:
 342
 343          proprietary
 344          transport
 345   VMware -----> nbdkit ----> nbdkit ----> qcow2
 346   ESXi          vddk         rate         snapshot
 347                 plugin       filter
 348
 349   qcow2    <---- sparsify
 350   snapshot <---- install drivers
 351            -----> qemu-img convert
 352
 353                     nbd          HTTPS
 354   qemu-img convert ----> nbdkit  -----> imageio
 355                          python
 356                          plugin
 357
 358 Discuss:
 359
 360  - separate input and output sides
 361
 362  - NBD used extensively
 363
 364  - very efficient and no large temporary copies
 365
 366  - rate filter
 367
 368  - many other tricks used
 369
 370
 371
 372 Conclusions
 373 ----------------------------------------------------------------------
 374
 375 Disk image pipelines:
 376
 377  - efficient
 378
 379  - flexible
 380
 381  - avoid local copies
 382
 383  - avoid copying zeroes/sparseness/deleted data
 384
 385  - sparsification
 386
 387  - modifications in flight
 388
 389
 390 Future work / other topics
 391 ----------------------------------------------------------------------
 392
 393 nbdcopy vs qemu-img convert
 394
 395 copy-on-read, bounded caches
 396
 397 block size adjustment
 398
 399 reading from containers
 400
 401 stop using gzip!
 402
 403
 404 References
 405 ----------------------------------------------------------------------
 406
 407 http://git.annexia.org/?p=libguestfs-talks.git;a=tree;f=2021-pipelines
 408
 409 https://gitlab.com/nbdkit
 410
 411 https://libguestfs.org
 412
 413 https://libguestfs.org/virt-v2v.1.html