2016-eng-talk/talk.txt

   1 Good afternoon everyone.
   2
   3 My name is Richard Jones and I work for Red Hat on a suite of tools we
   4 call the "virt tools" and libguestfs for manipulating disk images and
   5 virtual machines.
   6
   7 Today I'm going to talk about how long it takes to boot up and shut
   8 down virtual machines.
   9
  10   SLIDE: Intel talk
  11
  12 Originally I had planned to give this talk at the KVM Forum in August,
  13 but Intel submitted a similar talk, and as I'll discuss in a minute,
  14 they have had a large team working on this for over two years so they
  15 are much further ahead.  So this is the details of the Intel talk and
  16 you should definitely go and see that if you're at KVM Forum in
  17 August, or as an online video afterwards.
  18
  19   - - - -
  20
  21 It's "common knowledge" (in quotes) that full virtualization is slow
  22 and heavyweight, whereas containers (which are just a chroot, really)
  23 are fast and lightweight.
  24
  25   SLIDE: Start and stop times
  26
  27 Here is a slide which I clipped from a random presentation on the
  28 internet.  The times are nonsense and not backed up by any evidence in
  29 the rest of the presentation, but it's typical of the sort of
  30 incorrect information circulating.
  31
  32   SLIDE: Clear containers logo
  33
  34 This all changed when Intel announced their project called "Clear
  35 Containers".  They had a downloadable demonstration which showed a
  36 full VM booting to a login prompt in 150ms, and using about 20MB of
  37 memory.
  38
  39 Today I'll try to persuade you that:
  40
  41   SLIDE: Performance brings security
  42
  43 performance on a par with Intel Clear Containers brings security, and
  44 new opportunities (new places) where we can use full virtualization.
  45 If we can wrap Docker containers in VMs, we can make them secure.  But
  46 we can only do that if the overhead is low enough that we don't lose
  47 the density and performance advantages of containers.  And there are
  48 new areas where high performance full virtualization makes sense,
  49 particularly sandboxing individual applications, and desktop
  50 environments similar to Qubes.
  51
  52 Intel's Clear Containers demo is an existence proof that it can be
  53 done.
  54
  55 However ... there are shortcomings at the moment.  The original demo
  56 used kvmtool not QEMU.  And it used a heavily customized Linux kernel.
  57 Can we do the same thing with QEMU and a stock Linux distribution
  58 kernel?
  59
  60   SLIDE: No
  61
  62   SLIDE: No, but we can do quite well
  63
  64 It should be possible to bring boot times down to around 500-600ms
  65 without any gross hacks and without an excessive amount of effort.
  66
  67 The first step to curing a problem is measuring the problem.
  68
  69   SLIDE: qemu -kernel boot
  70
  71 This is how a Linux appliance boots.  I should say at this point I'm
  72 only considering ordinary QEMU with the "-kernel" option, and only on
  73 x86-64 hardware.
  74
  75 Some of the things on this slide you might not think of as part of the
  76 boot process, but they are all overhead as far as wrapping a container
  77 in a VM is concerned.  The important thing is that not everything here
  78 is "inside" QEMU or the guest, and so QEMU-based or guest-based
  79 tracing tools do not help.
  80
  81 What I did eventually was to connect a serial port to QEMU and
  82 timestamp all the messages printed by various stages of the boot
  83 sequence, and then write a couple of programs to analyze these
  84 timestamped messages using string matching and regular expressions, to
  85 produce boot benchmarks and charts.
  86
  87   SLIDE: boot-analysis 1
  88
  89 The more complex boot-analysis tool produces boot charts showing
  90 each stage of booting ...
  91
  92   SLIDE: boot-analysis 2
  93
  94 ... and which activities took the longest time.
  95
  96   SLIDE: boot-benchmark
  97
  98 The boot-benchmark tool produces simple timings averaged over 10 runs.
  99
 100 The tools are based on the libguestfs framework and are quite easy to
 101 run.  With a recent Linux distro anyone should be able to run them.
 102 You can download these screenshots and get links to the tools in the
 103 PDF paper which accompanies this talk.
 104
 105   SLIDE: progress graph
 106
 107 Here's an interesting graph of my progress on this problem over time,
 108 versus the appliance boot time up the left in milliseconds.  I spent
 109 the first couple of weeks in March exploring different ways to trace
 110 QEMU, and finally writing those tools.  Once I had tools giving me
 111 visibility into what was really going on, I got the time down from
 112 3500ms down to 1200ms in the space of a few days.
 113
 114 It's worth noting that the libguestfs appliance had been taking 3 or 4
 115 seconds to start for literally half a decade before this.
 116
 117 But there's also a long tail with diminishing returns.
 118
 119 It would be tempting at this point for me to describe every single
 120 problem and fix.  But that would be boring and if you're really
 121 interested in that, it's all described in the paper that accompanies
 122 this talk.  Instead, I wanted to try to classify the delays according
 123 to their root cause.  So we should be able to see, I hope, that some
 124 problems are easily fixed, whereas others are "institutionally" or
 125 "organizationally" very difficult to deal with.
 126
 127 Let's have a look at a few delays which are avoidable.  I like to
 128 classify these according to their root cause.
 129
 130   SLIDE: 16550A UART probing
 131
 132 When the Linux kernel boots it spends 25ms probing the serial port to
 133 check it's really a working 16550A.  Hypervisors never export broken
 134 serial ports to their guests, so this is useless.  The kernel
 135 maintainers' argument is that you might passthrough a real serial port
 136 to a guest, so just checking that the kernel is running under KVM
 137 isn't sufficient to bypass this test.  I haven't managed to get a good
 138 solution to this, but it'll probably involve some kind of ACPI
 139 description to say "yes really this is a working UART from KVM".
 140
 141   SLIDE: misaligned goals
 142
 143 My root cause for that problem is, I think, misaligned goals.  The
 144 goals of booting appliances very fast don't match the correctness
 145 goals of the kernel.
 146
 147   SLIDE: hwclock
 148
 149 We ran hwclock in the init script.  The tools told us this was taking
 150 300ms.  It's not even necessary since we always have kvmclock
 151 available in the appliance.
 152
 153   SLIDE: carelessness
 154
 155 I'm going to put that one down to carelessness.  Because I didn't have
 156 the tools before to analyze the boot process, commands crept into the
 157 init script which looked innocent but actually took a huge amount of
 158 time to run.  With better tools we should be able to avoid this
 159 happening in future.
 160
 161   SLIDE: kernel debug messages
 162
 163 The largest single saving was realizing that we shouldn't print out
 164 all the kernel debug messages to the serial console unless we're
 165 operating in debug mode.  In non-debug mode, the messages fed to the
 166 serial port are thrown away.  The solution was to add a statement
 167 saying "if we're not in debug mode, add the quiet option to the kernel
 168 command line", and that saved 1000 milliseconds.
 169
 170   SLIDE: stupidity
 171
 172 Let's now look at examples of problems that cannot be solved
 173 easily.
 174
 175   SLIDE: ELF loader
 176
 177 QEMU in Fedora is linked to 170 libraries, and just doing "qemu
 178 -version" on my laptop takes about 60ms.  We can't reduce the number
 179 of libraries very easily, especially when we want to probe the
 180 features of QEMU which necessarily means loading all the libraries.
 181 And we can't simplify the ELF loader either because it is operating to
 182 a complex standard with lots of obscure features like "symbol
 183 interposition".
 184
 185   SLIDE: standards
 186
 187 Standards means there are some things you just have to work around.
 188 In the case of qemu probing, by aggressive caching.
 189
 190   SLIDE: initcalls
 191
 192 The kernel performs a large number of "initcalls" -- initializer
 193 functions.  It does them serially on a single processor.  And every
 194 subsystem which is enabled seems to have an initcall.  Individual
 195 initcalls are often very short, but the sheer number of them is a
 196 problem.
 197
 198 One solution would be parallelization, but this has been rejected
 199 upstream.
 200
 201 Another would be to use a custom kernel build which chops out
 202 any subsystem that doesn't apply to our special case virtual
 203 machine.  In fact I tried this approach and it's possible to
 204 reduce the time spent in initcalls before reaching userspace
 205 by 60%.
 206
 207   SLIDE: maintainability
 208
 209 BUT for distro kernels we usually want a single kernel image that can
 210 be used everywhere.  Baremetal, virtual machines.  That's because
 211 building and maintaining multiple kernel images, fixing bugs, tracking
 212 CVEs, and so on is too difficult with multiple images.  Experience
 213 with Xen and on ARM taught us that.
 214
 215 So Intel's solution which was to build a custom cut-down kernel is not
 216 one that most distros will be able to use.
 217
 218 I have more numbers to support this in the accompanying paper.
 219
 220 Last one ...
 221
 222   SLIDE: udev
 223
 224 At the moment, after the kernel initcalls, udev is the slowest
 225 component.  udev takes 130ms to populate /dev.
 226
 227   SLIDE: ?
 228
 229 The cause here may be the inflexible and indivisible configuration of
 230 udev.  We cannot and don't want to maintain our own simplified
 231 configuration and it's impossible to split up the existing config.
 232 But maybe that's not the problem.  It might be the kernel sending udev
 233 events serially, or udev processing them slowly, or the number of
 234 external shell scripts that udev runs, or who knows.
 235
 236   SLIDE: Conclusions
 237
 238   SLIDE: Conclusions 2
 239
 240 And finally a plea from the heart ...
 241
 242   SLIDE: Please think
 243