2016-eng-talk/talk.txt

   1 Good afternoon everyone.
   2
   3 My name is Richard Jones and I work for Red Hat on a suite of tools we
   4 call the "virt tools" and libguestfs for manipulating disk images and
   5 virtual machines.
   6
   7 Today I'm going to talk about how long it takes to boot up and shut
   8 down virtual machines.
   9
  10 It's "common knowledge" (in quotes) that full virtualization is slow
  11 and heavyweight, whereas containers (which are just a chroot, really)
  12 are fast and lightweight.
  13
  14   SLIDE: Start and stop times
  15
  16 Here is a slide which I clipped from a random presentation on the
  17 internet.  The times are nonsense and not backed up by any evidence in
  18 the rest of the presentation, but it's typical of the sort of
  19 information circulating.
  20
  21   SLIDE: Clear containers logo
  22
  23 This all changed when Intel announced their project called "Clear
  24 Containers".  They had a downloadable demonstration which showed a
  25 full VM booting to a login prompt in 150ms, and using about 20MB of
  26 memory.
  27
  28 Today I'll try to persuade you that:
  29
  30   SLIDE: Performance brings security
  31
  32 performance on a par with Intel Clear Containers brings security, and
  33 new opportunities (new places) where we can use full virtualization.
  34 If we can wrap Docker containers in VMs, we can make them secure.  But
  35 we can only do that if the overhead is low enough that we don't lose
  36 the density and performance advantages of containers.  And there are
  37 new areas where high performance full virtualization makes sense,
  38 particularly sandboxing individual applications, and desktop
  39 environments similar to Qubes.
  40
  41 Intel's Clear Containers demo is an existence proof that it can be
  42 done.
  43
  44 However ... there are shortcomings at the moment.  The original demo
  45 used kvmtool not QEMU.  And it used a heavily customized Linux kernel.
  46 Can we do the same thing with QEMU and a stock Linux distribution
  47 kernel?
  48
  49   SLIDE: No
  50
  51   SLIDE: No, but we can do quite well
  52
  53 It should be possible to bring boot times down to around 500-600ms
  54 without any gross hacks and without an excessive amount of effort.
  55
  56 The first step to curing the problem is measuring it.
  57
  58   SLIDE: qemu -kernel boot
  59
  60 This is how a Linux appliance boots.  I should say at this point I'm
  61 only considering ordinary QEMU with the "-kernel" option, and only on
  62 x86-64 hardware.
  63
  64 Some of the things on this slide you might not think of as part of the
  65 boot process, but they are all overhead as far as wrapping a container
  66 in a VM is concerned.  The important thing is that not everything here
  67 is "inside" QEMU or the guest, and so QEMU-based or guest-based
  68 tracing tools do not help.
  69
  70 What I did eventually was to connect a serial port to QEMU and
  71 timestamp all the messages printed by various stages of the boot
  72 sequence, and then write a couple of programs to analyze these
  73 timestamped messages using string matching and regular expressions, to
  74 produce boot benchmarks and charts.
  75
  76   SLIDE: boot-benchmark
  77
  78 The boot-benchmark tool produces simple timings averaged over 10 runs.
  79
  80   SLIDE: boot-analysis 1
  81
  82 The more complex boot-analysis tool produces boot charts showing
  83 each stage of booting ...
  84
  85   SLIDE: boot-analysis 2
  86
  87 ... and which activities took the longest time.
  88
  89 The tools are based on the libguestfs framework and are surprisingly
  90 accessible.  With a recent Linux distro anyone should be able to run
  91 them.  You can download these screenshots and get links to the tools
  92 in the PDF paper which accompanies this talk.
  93
  94   SLIDE: progress graph
  95
  96 Here's an interesting graph of my progress on this problem over time,
  97 versus the appliance boot time up the left in milliseconds.  I spent
  98 the first couple of weeks in March exploring different ways to trace
  99 QEMU, and finally writing those tools.  Once I had to tools giving me
 100 visibility into what was really going on, I got the time down from
 101 3500ms down to 1200ms in the space of a few days.
 102
 103 It's worth noting that the libguestfs appliance had been booting in 3
 104 to 4 seconds for literally half a decade before this.
 105
 106 But there's also a long tail with diminishing returns.
 107
 108 Let's have a look at a few delays which are avoidable.  I like to
 109 classify these according to their root cause.
 110
 111   SLIDE: 16550A UART probing
 112
 113 When the Linux kernel boots it spends 25ms probing the serial port to
 114 check it's really a working 16550A.  Hypervisors never export broken
 115 serial ports to their guests, so this is useless.  The kernel
 116 maintainers' argument is that you might passthrough a real serial port
 117 to a guest, so just checking that the kernel is running under KVM
 118 isn't sufficient to bypass this test.  I haven't managed to get a good
 119 solution to this, but it'll probably involve some kind of ACPI
 120 description to say "yes really this is a working UART from KVM".
 121
 122   SLIDE: misaligned goals
 123
 124 My root cause for that problem is, I think, misaligned goals.  The
 125 goals of booting appliances very fast don't match the correctness
 126 goals of the kernel.
 127
 128   SLIDE: hwclock
 129
 130 We ran hwclock in the init script.  The tools told us this was taking
 131 300ms.  It's not even necessary since we always have kvmclock
 132 available in the appliance.
 133
 134   SLIDE: carelessness
 135
 136 I'm going to put that one down to carelessness.  Because I didn't have
 137 the tools before to analyze the boot process, commands crept into the
 138 init script which looked innocent but actually took a huge amount of
 139 time to run.  With better tools we should be able to avoid this
 140 happening in future.
 141
 142   SLIDE: kernel debug messages
 143
 144 The largest single saving was realizing that we shouldn't print out
 145 all the kernel debug messages to the serial console unless we're
 146 operating in debug mode.  In non-debug mode, the messages are just
 147 thrown away.  The solution was to add a statement saying "if we're not
 148 in verbose mode, add the quiet option to the kernel command line", and
 149 that saved 1000 milliseconds.
 150
 151   SLIDE: stupidity
 152
 153 Let's now look at examples of problems that cannot be solved
 154 easily.
 155
 156   SLIDE: ELF loader
 157
 158 QEMU in Fedora is linked to 170 libraries, and just doing "qemu
 159 -version" on my laptop takes about 60ms.  We can't reduce the number
 160 of libraries very easily, especially when we want to probe the
 161 features of QEMU which necessarily means loading all the libraries.
 162 And we can't simplify the ELF loader either because it is operating to
 163 a complex standard with lots of obscure features like "symbol
 164 interposition".
 165
 166   SLIDE: standards
 167
 168 Standards means there are some things you just have to work around.
 169 In the case of qemu probing, by aggressive caching.
 170
 171   SLIDE: initcalls
 172
 173 The kernel performs a large number of "initcalls" -- initializer
 174 functions.  It does them serially on a single processor.  And every
 175 subsystem which is enabled seems to have an initcall.  Individual
 176 initcalls are often very short, but the sheer number of them is a
 177 problem.
 178
 179 One solution would be parallelization, but this has been rejected
 180 upstream.
 181
 182 Another would be to use a custom kernel build which chops out
 183 any subsystem that doesn't apply to our special case virtual
 184 machine.  In fact I tried this approach and it's possible to
 185 reduce the time spent in initcalls before reaching userspace
 186 by 60%.
 187
 188   SLIDE: maintainability
 189
 190 BUT for distro kernels we usually want a single kernel image that can
 191 be used everywhere.  Baremetal, virtual machines.  That's because
 192 building and maintaining multiple kernel images, fixing bugs, tracking
 193 CVEs, and so on is too difficult with multiple images.  Experience
 194 with Xen and on ARM taught us that.
 195
 196 So Intel's solution which was to build a custom cut-down kernel is not
 197 one that most distros will be able to use.
 198
 199 I have more numbers to support this in the accompanying paper.
 200
 201 Last one ...
 202
 203   SLIDE: udev
 204
 205 At the moment, after the kernel initcalls, udev is the slowest
 206 component.  udev takes 130ms to populate /dev.
 207
 208   SLIDE: ?
 209
 210 The cause here may be the inflexible and indivisible configuration of
 211 udev.  We cannot and don't want to maintain our own simplified
 212 configuration and it's impossible to split up the existing config.
 213 But maybe that's not the problem.  It might be the kernel sending udev
 214 events serially, or udev processing them slowly, or the number of
 215 external shell scripts that udev runs, or who knows.
 216
 217   SLIDE: Conclusions
 218
 219
 220
 221
 222 And finally a plea from the heart ...
 223
 224   SLIDE: Please think
 225