Good afternoon everyone.

My name is Richard Jones and I work for Red Hat on a suite of tools we
call the "virt tools" and libguestfs for manipulating disk images and
virtual machines.

Today I'm going to talk about how long it takes to boot up and shut
down virtual machines.

  SLIDE: Intel talk

Originally I had planned to give this talk at the KVM Forum in August,
but Intel submitted a similar talk, and as I'll discuss in a minute,
they have had a large team working on this for over two years so they
are much further ahead.  So this is the details of the Intel talk and
you should definitely go and see that if you're at KVM Forum in
August, or as an online video afterwards.

  - - - -

It's "common knowledge" (in quotes) that full virtualization is slow
and heavyweight, whereas containers (which are just a chroot, really)
are fast and lightweight.

  SLIDE: Start and stop times

Here is a slide which I clipped from a random presentation on the
internet.  The times are nonsense and not backed up by any evidence in
the rest of the presentation, but it's typical of the sort of
incorrect information circulating.

  SLIDE: Clear containers logo

This all changed when Intel announced their project called "Clear
Containers".  They had a downloadable demonstration which showed a
full VM booting to a login prompt in 150ms, and using about 20MB of
memory.

Today I'll try to persuade you that:

  SLIDE: Performance brings security

performance on a par with Intel Clear Containers brings security, and
new opportunities (new places) where we can use full virtualization.
If we can wrap Docker containers in VMs, we can make them secure.  But
we can only do that if the overhead is low enough that we don't lose
the density and performance advantages of containers.  And there are
new areas where high performance full virtualization makes sense,
particularly sandboxing individual applications, and desktop
environments similar to Qubes.

Intel's Clear Containers demo is an existence proof that it can be
done.

However ... there are shortcomings at the moment.  The original demo
used kvmtool not QEMU.  And it used a heavily customized Linux kernel.
Can we do the same thing with QEMU and a stock Linux distribution
kernel?

  SLIDE: No

  SLIDE: No, but we can do quite well

It should be possible to bring boot times down to around 500-600ms
without any gross hacks and without an excessive amount of effort.

The first step to curing a problem is measuring the problem.

  SLIDE: qemu -kernel boot

This is how a Linux appliance boots.  I should say at this point I'm
only considering ordinary QEMU with the "-kernel" option, and only on
x86-64 hardware.

Some of the things on this slide you might not think of as part of the
boot process, but they are all overhead as far as wrapping a container
in a VM is concerned.  The important thing is that not everything here
is "inside" QEMU or the guest, and so QEMU-based or guest-based
tracing tools do not help.

What I did eventually was to connect a serial port to QEMU and
timestamp all the messages printed by various stages of the boot
sequence, and then write a couple of programs to analyze these
timestamped messages using string matching and regular expressions, to
produce boot benchmarks and charts.

  SLIDE: boot-analysis 1

The more complex boot-analysis tool produces boot charts showing
each stage of booting ...

  SLIDE: boot-analysis 2

... and which activities took the longest time.

  SLIDE: boot-benchmark

The boot-benchmark tool produces simple timings averaged over 10 runs.

The tools are based on the libguestfs framework and are quite easy to
run.  With a recent Linux distro anyone should be able to run them.
You can download these screenshots and get links to the tools in the
PDF paper which accompanies this talk.

  SLIDE: progress graph

Here's an interesting graph of my progress on this problem over time,
versus the appliance boot time up the left in milliseconds.  I spent
the first couple of weeks in March exploring different ways to trace
QEMU, and finally writing those tools.  Once I had tools giving me
visibility into what was really going on, I got the time down from
3500ms down to 1200ms in the space of a few days.

It's worth noting that the libguestfs appliance had been taking 3 or 4
seconds to start for literally half a decade before this.

But there's also a long tail with diminishing returns.

It would be tempting at this point for me to describe every single
problem and fix.  But that would be boring and if you're really
interested in that, it's all described in the paper that accompanies
this talk.  Instead, I wanted to try to classify the delays according
to their root cause.  So we should be able to see, I hope, that some
problems are easily fixed, whereas others are "institutionally" or
"organizationally" very difficult to deal with.

Let's have a look at a few delays which are avoidable.  I like to
classify these according to their root cause.

  SLIDE: 16550A UART probing

When the Linux kernel boots it spends 25ms probing the serial port to
check it's really a working 16550A.  Hypervisors never export broken
serial ports to their guests, so this is useless.  The kernel
maintainers' argument is that you might passthrough a real serial port
to a guest, so just checking that the kernel is running under KVM
isn't sufficient to bypass this test.  I haven't managed to get a good
solution to this, but it'll probably involve some kind of ACPI
description to say "yes really this is a working UART from KVM".

  SLIDE: misaligned goals

My root cause for that problem is, I think, misaligned goals.  The
goals of booting appliances very fast don't match the correctness
goals of the kernel.

  SLIDE: hwclock

We ran hwclock in the init script.  The tools told us this was taking
300ms.  It's not even necessary since we always have kvmclock
available in the appliance.

  SLIDE: carelessness

I'm going to put that one down to carelessness.  Because I didn't have
the tools before to analyze the boot process, commands crept into the
init script which looked innocent but actually took a huge amount of
time to run.  With better tools we should be able to avoid this
happening in future.

  SLIDE: kernel debug messages

The largest single saving was realizing that we shouldn't print out
all the kernel debug messages to the serial console unless we're
operating in debug mode.  In non-debug mode, the messages fed to the
serial port are thrown away.  The solution was to add a statement
saying "if we're not in debug mode, add the quiet option to the kernel
command line", and that saved 1000 milliseconds.

  SLIDE: stupidity

Let's now look at examples of problems that cannot be solved
easily.

  SLIDE: ELF loader

QEMU in Fedora is linked to 170 libraries, and just doing "qemu
-version" on my laptop takes about 60ms.  We can't reduce the number
of libraries very easily, especially when we want to probe the
features of QEMU which necessarily means loading all the libraries.
And we can't simplify the ELF loader either because it is operating to
a complex standard with lots of obscure features like "symbol
interposition".

  SLIDE: standards

Standards means there are some things you just have to work around.
In the case of qemu probing, by aggressive caching.

  SLIDE: initcalls

The kernel performs a large number of "initcalls" -- initializer
functions.  It does them serially on a single processor.  And every
subsystem which is enabled seems to have an initcall.  Individual
initcalls are often very short, but the sheer number of them is a
problem.

One solution would be parallelization, but this has been rejected
upstream.

Another would be to use a custom kernel build which chops out
any subsystem that doesn't apply to our special case virtual
machine.  In fact I tried this approach and it's possible to
reduce the time spent in initcalls before reaching userspace
by 60%.

  SLIDE: maintainability

BUT for distro kernels we usually want a single kernel image that can
be used everywhere.  Baremetal, virtual machines.  That's because
building and maintaining multiple kernel images, fixing bugs, tracking
CVEs, and so on is too difficult with multiple images.  Experience
with Xen and on ARM taught us that.

So Intel's solution which was to build a custom cut-down kernel is not
one that most distros will be able to use.

I have more numbers to support this in the accompanying paper.

Last one ...

  SLIDE: udev

At the moment, after the kernel initcalls, udev is the slowest
component.  udev takes 130ms to populate /dev.

  SLIDE: ?

The cause here may be the inflexible and indivisible configuration of
udev.  We cannot and don't want to maintain our own simplified
configuration and it's impossible to split up the existing config.
But maybe that's not the problem.  It might be the kernel sending udev
events serially, or udev processing them slowly, or the number of
external shell scripts that udev runs, or who knows.

  SLIDE: Conclusions

  SLIDE: Conclusions 2

And finally a plea from the heart ...

  SLIDE: Please think