2016-kvm-forum/paper.tex

   1 \documentclass[12pt,a4paper]{article}
   2 \usepackage[utf8x]{inputenc}
   3 \usepackage{parskip}
   4 \usepackage{hyperref}
   5 \usepackage{xcolor}
   6 \hypersetup{
   7     colorlinks,
   8     linkcolor={red!50!black},
   9     citecolor={blue!50!black},
  10     urlcolor={blue!80!black}
  11 }
  12 \usepackage{abstract}
  13 \usepackage{graphicx}
  14 \DeclareGraphicsExtensions{.pdf,.png,.jpg}
  15 \usepackage{float}
  16 \floatstyle{boxed}
  17 \restylefloat{figure}
  18 \usepackage{fancyhdr}
  19   \pagestyle{fancy}
  20   %\fancyhead{}
  21   %\fancyfoot{}
  22
  23 \title{Optimizing QEMU boot time}
  24 \author{
  25 \large
  26 Richard W.M. Jones
  27 \normalsize Red Hat Inc.
  28 \normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com}
  29 }
  30 \date{}
  31
  32 \begin{document}
  33 \maketitle
  34
  35 \begin{abstract}
  36 Everyone knows that containers are really fast and lightweight, and
  37 full virtualization is slow and heavyweight ... Or that's what we
  38 thought, until Intel demonstrated full Linux virtual machines booting
  39 as fast as containers and using as little memory.  Intel's work used
  40 kvmtool and a customized, cut down guest kernel.  Can we do the same
  41 using libvirt, QEMU, SeaBIOS, and an off the shelf Linux distro
  42 kernel?  The short answer is \textit{no}, but we can get pretty close,
  43 and it was an exciting journey learning about unexpected performance
  44 roadblocks, developing tools to measure the boot process, and shaving
  45 off milliseconds all over the place.  The work has practical
  46 significance because it will allow us to deploy secure containers,
  47 protected by hardware virtualization.  Even if you never plan to use
  48 containers, you're still benefiting from a faster QEMU experience.
  49 \end{abstract}
  50
  51 \section{Intel Clear Linux}
  52
  53 Intel's Clear Linux means a lot of different things to different
  54 people.  I'm only going to talk about a narrow aspect of it, usually
  55 known as ``Clear Containers'', but if other people talk about Intel
  56 Clear Linux they might be talking about a Linux distribution,
  57 OpenStack or graphics technologies.
  58
  59 LWN has a useful and relatively recent introduction to Clear
  60 Containers \url{https://lwn.net/Articles/644675/}.
  61
  62 If you download the Intel Clear Containers demo
  63 (\url{https://download.clearlinux.org/demos/containers/clear-containers-demo.tar.xz}),
  64 unpack it and run \texttt{bash~./boot.sh} then it will boot into a
  65 full Linux VM in about 150ms, and using 20~MB of RAM.
  66
  67 Intel are using this technology along with a customized Docker driver
  68 to run Docker containers safely inside a VM.  The overhead (150ms /
  69 20~MB) is very attractive since it doesn't impact on the density that
  70 containers give you.  It's also aligned with Intel's interests, since
  71 they are selling chips with VT, VT-d, EPT, VPID and so on and they
  72 need people to use those features.
  73
  74 The Clear Containers demo uses \texttt{kvmtool} with several
  75 non-upstream patches such as for DAX and 64 bit guests.  Since first
  76 demonstrating Clear Containers, Intel has worked on getting vNVDIMM
  77 (needed for DAX) into QEMU.
  78
  79 The Clear Containers demo from last year uses a patched Linux kernel.
  80 There are many non-upstream patches.  More importantly they use a
  81 custom, cut down configuration where many subsystems not used by VMs
  82 are cut out entirely.
  83
  84 \section{Real Linux distros use QEMU}
  85
  86 Can we do the same sort of thing in our Linux distros?  Let's talk
  87 about some things that constrain us in Fedora.
  88
  89 We'd prefer to use QEMU over kvmtool.  QEMU isn't really ``bloated''.
  90 It's featureful, but (generally) if you're not using those features
  91 they don't slow things down.
  92
  93 We \textit{can't} use the heavily patched and customized kernel.
  94 Fedora is strictly ``upstream first''.  Fedora also ships a single
  95 kernel image for baremetal, virtual machines and all other uses, since
  96 building and maintaining multiple kernels is a huge pain.
  97
  98 \section{Stating the problem}
  99
 100 What we want to do is to boot up and shut down a modern Linux kernel
 101 in a KVM virtual machine on a modern Linux host.  Inside the virtual
 102 machine we will eventually want to run our Docker container.  However
 103 I am just concentrating on the overhead of the boot and shutdown.
 104
 105 \begin{samepage}
 106 Conveniently -- and also the reason I'm interested in this problem --
 107 libguestfs does almost the same thing.  It starts up and shuts down a
 108 small Linux-based appliance.  If you have \texttt{guestfish}
 109 installed, then you can try running the command below (several times
 110 so you have a warm cache).  Add \texttt{-v~-x} to the command line to
 111 see what's really going on.
 112
 113 \begin{verbatim}
 114 $ guestfish -a /dev/null run
 115 \end{verbatim}
 116 \end{samepage}
 117
 118 \section{Measurements}
 119
 120 The first step to improving the situation is to build tools that can
 121 accurately measure the time taken for each step in the boot process.
 122
 123 Booting a Linux kernel under QEMU using the \texttt{-kernel} option
 124 looks like table~\ref{tab:kernel-steps}.
 125
 126 \begin{table}[h]
 127 \caption{Steps run when you use QEMU \texttt{-kernel}}
 128 \centering
 129 \begin{tabular}{l}
 130   query QEMU's capabilities \\
 131   \hline
 132   run QEMU \\
 133   \hline
 134   run SeaBIOS \\
 135   \hline
 136   run the kernel \\
 137   \hline
 138   run the initramfs \\
 139   \hline
 140   load kernel modules \\
 141   \hline
 142   mount and pivot to the root filesystem \\
 143   \hline
 144   run \texttt{/init}, \texttt{udevd} etc \\
 145   \hline
 146   perform the desired task \\
 147   \hline
 148   shutdown \\
 149   \hline
 150   exit QEMU
 151 \end{tabular}
 152 \label{tab:kernel-steps}
 153 \end{table}
 154
 155 How do you know when SeaBIOS starts or various kernel events happen?
 156
 157 I started out looking at various QEMU tracing options, but ended up
 158 using a very simple technique: Attach a serial console to QEMU,
 159 timestamp the messages as they arrive, and use regular expression
 160 string matches to find significant events.
 161
 162 The three programs I wrote (two in C and one in Perl) use libguestfs
 163 as a convenient framework, since libguestfs has the machinery already
 164 for creating VMs, initramfses, capturing serial console output etc.
 165 They are:
 166
 167 \begin{itemize}
 168 \item \texttt{boot-benchmark}
 169
 170 \texttt{boot-benchmark} runs the boot up sequence repeatedly, throwing
 171 away the first few runs (to warm the cache) and collecting the mean
 172 test time and standard deviation.
 173
 174 \begin{verbatim}
 175 $ ./boot-benchmark
 176 Warming up the libguestfs cache ...
 177 Running the tests ...
 178
 179 test version: libguestfs 1.33.29
 180  test passes: 10
 181 host version: Linux moo.home.annexia.org 4.4.4-301.fc23.x86_64 #1 SMP
 182     host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
 183      backend: direct               [to change set $LIBGUESTFS_BACKEND]
 184         qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64
 185 qemu version: QEMU emulator version 2.6.50, Copyright (c) 2003-2008
 186          smp: 1                    [to change use --smp option]
 187      memsize: 500                  [to change use --memsize option]
 188       append:                      [to change use --append option]
 189
 190 Result: 568.2ms ±8.7ms
 191 \end{verbatim}
 192
 193 \item \texttt{boot-benchmark-range.pl}
 194
 195 \texttt{boot-benchmark-range.pl} is a wrapper script around
 196 \texttt{boot-benchmark} which lets you benchmark across a range of
 197 commits from some other project (eg. QEMU or the kernel).  You can
 198 easily see which commits are causing or solving performance problems
 199 as in the example below:
 200
 201 \begin{verbatim}
 202 $ ./boot-benchmark-range.pl ~/d/qemu 3123bd8^..8e86aa8
 203 da34fed hw/ppc/spapr: Fix crash when specifying bad[...]
 204         1666.8ms ±2.5ms
 205
 206 3123bd8 Merge remote-tracking branch 'remotes/dgibson/[...]
 207         1658.8ms ±4.2ms
 208
 209 f419a62 (origin/master, origin/HEAD, master) usb/uhci: move[...]
 210         1671.3ms ±17.0ms
 211
 212 8e86aa8 Add optionrom compatible with fw_cfg DMA version
 213         1013.7ms ±3.0ms ↑ improves performance by 64.9%
 214 \end{verbatim}
 215
 216 \item \texttt{boot-analysis}
 217
 218 \begin{figure}[h]
 219 \caption{boot-analysis timeline}
 220 \includegraphics[width=0.9\textwidth]{boot-analysis-screenshot}
 221 \label{fig:ba-timeline}
 222 \end{figure}
 223
 224 \texttt{boot-analysis} performs multiple runs of the boot sequence.
 225 It enables the QEMU serial console (and other events from libguestfs),
 226 timestamps the events, and then presents the sequence graphically as
 227 shown in figure~\ref{fig:ba-timeline}.  Also shown are mean times and standard
 228 deviations and percentage of the total run time.
 229
 230 \begin{figure}[h]
 231 \caption{boot-analysis longest activities}
 232 \includegraphics[width=0.9\textwidth]{boot-analysis-screenshot-2}
 233 \label{fig:ba-longest}
 234 \end{figure}
 235
 236 This test also prints which activities took the longest time, see
 237 figure~\ref{fig:ba-longest}.
 238
 239 \end{itemize}
 240
 241 The source for these tools is here:
 242 \url{https://github.com/libguestfs/libguestfs/tree/master/utils}.
 243
 244 Only now that we have the right tools to hand can we work out what
 245 activities take time.
 246
 247 For consistency, all times displayed by the tool are in milliseconds
 248 (ms), and I try to use the same convention in this paper.
 249
 250 In this paper I'm using times based on my laptop, an
 251 Intel\textregistered Core\texttrademark i7-5600U CPU @ 2.60GHz
 252 (Broadwell~U).  This does of course mean that these results won't be
 253 exactly reproducible, but it is hoped that with similar hardware you
 254 will get times that differ only by a scale factor.
 255
 256 \section{glibc}
 257
 258 Surprisingly the first problem is glibc.  QEMU links to over 170
 259 libraries, and that number keeps growing.  A simple
 260 \texttt{qemu~-version} takes up to 60ms, and examining this with
 261 \texttt{perf} showed two things:
 262
 263 \begin{itemize}
 264 \item Ceph had a bug where it ran some \texttt{rdtsc} benchmarks in a
 265   constructor function.  This is now fixed.
 266 \item The glibc link loader is really slow when presented with lots of
 267   libraries and lots of symbols.
 268 \end{itemize}
 269
 270 The second problem is intractable.  We can't link to fewer libraries,
 271 because each of those libraries represents some feature that someone
 272 wants, like Ceph, or Gtk support (though if you remove the Gtk
 273 dependency the link time reduces substantially).  And the link loader
 274 is bound by all sorts of obscure ELF rules (eg. symbol interposition)
 275 which we don't need but cannot avoid and make things slow.
 276
 277 When I said earlier that QEMU features don't slow things down, this is
 278 an exception.
 279
 280 We can run QEMU fewer times.  There are several places where we need
 281 to run QEMU.  Obviously one place is where we start the virtual
 282 machine, and the overhead there cannot be avoided.  But also we
 283 perform QEMU feature detection by running commands like
 284 \texttt{qemu~-help} and \texttt{qemu~-devices~\textbackslash?} and
 285 libguestfs now caches that output.
 286
 287 \section{QEMU}
 288
 289 Libguestfs, Intel Clear Containers, and any future Docker container
 290 support we build will use \texttt{-kernel} and \texttt{-initrd} or
 291 their equivalent.  In QEMU up to 2.6 on x86-64 this was implemented
 292 using an interface called \texttt{fw\_cfg} and a PIO loop, and that is
 293 very slow.  To load the kernel and very small initrd used by
 294 libguestfs takes around 700ms.  In QEMU 2.7 we have added a pseudo-DMA
 295 mode which makes this step almost instant.
 296
 297 To see debugging messages from the kernel and to collect our benchmark
 298 results, we have to use an emulated 16550A UART (serial port).
 299 Virtio-console exists but isn't a good replacement because it can't be
 300 used to get BIOS and very early kernel messages.  The UART is slow.
 301 It takes about 4µs per character, or approximately 1ms for 3 lines of
 302 text.  Enabling debugging changes the results subtly.
 303
 304 To get serial console output from the BIOS, we use a
 305 Google-contributed option ROM called SGABIOS.  It quickly became clear
 306 that SGABIOS introduced a 260ms boot delay.  This happened because it
 307 expects to be talking to a real serial terminal, so it sends control
 308 sequences to query the width and height of this ``terminal''.  These
 309 weren't being answered by the actual reader (libguestfs simply reads).
 310 The solution was to modify libguestfs to respond to the control
 311 sequence with a dummy reply, which reduced the delay to almost
 312 nothing.
 313
 314 \section{libvirt}
 315
 316 Libguestfs can optionally use libvirt to manage the QEMU process.
 317 When I did this it was obvious that libvirt was adding a (precisely)
 318 200ms delay.  I tracked this down to a poorly implemented polling loop
 319 in libvirt, waiting for the QEMU monitor socket to be created by QEMU.
 320 I fixed it by changing the loop to use exponential backoff.  A better
 321 fix would involve passing pre-created file descriptors to QEMU.
 322
 323 \section{SeaBIOS}
 324
 325 SeaBIOS wastes time probing for boot devices even though we will use
 326 the \texttt{linuxboot} option ROM to boot (via \texttt{-kernel}).  By
 327 building a \texttt{bios-fast.bin} variant of SeaBIOS with many unused
 328 features disabled we can reduce the time spent inside the BIOS from
 329 about 63ms to about 19ms.
 330
 331 \section{kernel}
 332
 333 PCI probing is slow, taking around 95ms for a guest with just two
 334 virtio-scsi drives.  It turns out that it's not the scanning of the
 335 PCI device space which is slow, but the initialization of each device
 336 as it is found.  QEMU's i440fx machine model exports some legacy
 337 devices which cannot be switched off, and that is unhelpful.
 338
 339 I implemented experimental support for parallel PCI probing using the
 340 kernel ``async'' feature.  With 1~vCPU this slows things down very
 341 slightly as expected.  With 4~vCPUs performance improved by about
 342 30\%.  Unfortunately we can't use it because of the next point.
 343
 344 You would think multiple vCPUs would be better and faster than 1~vCPU,
 345 but that is not the case.  It actually has a large negative impact on
 346 performance.  Switching from 1 to 4~vCPUs increases the boot time by
 347 over 200ms.  About 25ms is spent starting each secondary CPU (in
 348 \texttt{check\_tsc\_sync\_target}).  This can be avoided by setting
 349 \texttt{tsc.reliable=1} but no one can tell me if this is safe.  But
 350 most of the extra time just disappears between the cracks -- for
 351 example, PCI probing just slows down, but for no readily apparent
 352 reason.  It seems as if the overhead of spinlocks or RCU or whatever
 353 hurts general performance.  Or perhaps there is some scheduling
 354 problem on the host since it only has 4 physical CPUs.
 355
 356 When the kernel runs, it does some BIOS stuff, and there's a long
 357 delay (about 80ms) before \texttt{start\_kernel} is entered.
 358
 359 Another unavoidable overhead is \texttt{ftrace} which must modify
 360 every function in the kernel.  This takes 20ms.  You can't disable
 361 ftrace at run time, the only option is to compile it out, but that
 362 breaks so many useful features that we'd never persuade a distro
 363 kernel to do that.
 364
 365 If your kernel has crypto functions, then it will spend 18ms testing
 366 them at boot.  Herbert Xu accepted my patch to add a
 367 \texttt{cryptomgr.notests} flag which bypasses this.
 368
 369 As we are presenting an emulated 16550A UART,
 370 \texttt{serial\_8250\_init} runs, and this spends 25ms checking that
 371 the UART is really a 16650A (does it work, does it have a FIFO?), and
 372 (unsurprisingly) yes it is.  This is a totally useless waste of time,
 373 but I have not managed to come up with a patch or even with an
 374 approach for how to avoid this that is acceptable upstream.
 375
 376 But the main problem is none of the above.  It's simply the small
 377 amount of time taken to run many many initcalls.  For a distro kernel
 378 this can be around 690ms (with serial debugging enabled which
 379 exaggerates the effect somewhat).  One way to avoid that would be to
 380 compile some sort of custom kernel, and even though this approach is
 381 not acceptable for Fedora I did explore this, trying both a cut down
 382 distro kernel, and also a super-minimal kernel.
 383
 384 \begin{itemize}
 385 \item The cut down distro kernel works by removing any subsystem that
 386   has a $>$~1ms initcall overhead.  These include:
 387   \begin{itemize}
 388   \item auditing
 389   \item big\_key
 390   \item ftrace
 391   \item hugetlbfs
 392   \item input\_leds
 393   \item joydev
 394   \item joysticks
 395   \item keyboards
 396   \item kprobes
 397   \item libata
 398   \item mice
 399   \item microcode
 400   \item netlabel
 401   \item profiling support
 402   \item quota
 403   \item rtc\_drv\_cmos
 404   \item sound card support
 405   \item tablets
 406   \item touchscreens
 407   \item USB
 408   \item zbud
 409   \item zswap
 410   \end{itemize}
 411   That reduces the time taken running initcalls before userspace by
 412   about 20\%.  There is some scope for reducing this a bit more by
 413   going even further down the ``long tail'' of subsystems.
 414 \item For my second test I started with an absolutely minimal kernel
 415   config (\texttt{allnoconfig}), and built up the configuration until
 416   I got something that booted.  That reduces the time taken running
 417   initcalls before userspace by about 60\% (down to 288ms).
 418 \end{itemize}
 419
 420 With a minimal kernel, we can get total boot times down to the
 421 500-600ms range, but not any lower.
 422
 423 \section{udev}
 424
 425 udevd takes about 130ms to populate \texttt{/dev}.
 426
 427 The rules are monolithic, entwined together and resist modification,
 428 and starting a new set of rules from scratch looks like it would be a
 429 constant game of catch up.
 430
 431 \section{initrd}
 432
 433 We use a program called supermin
 434 (\url{http://libguestfs.org/supermin.1.html}) to construct the initrd
 435 which is responsible for loading enough kmods to mount the real root
 436 filesystem and pivoting into it.
 437
 438 Because of PIO loading of the initrd in earlier versions of QEMU, it
 439 was very important to construct as small an initrd as possible, and
 440 supermin was not doing a very good job of that.  However once I
 441 started to analyze the situation there were some easy wins (now all
 442 upstream):
 443
 444 \begin{itemize}
 445 \item We were adding all virtio kmods to the appliance plus any
 446   dependencies, with the starting set being constructed using the
 447   wildcard ``\texttt{virtio*.ko}''.  The wildcard pulls in
 448   \texttt{virtio-gpu.ko} which depends on \texttt{drm.ko} and both are
 449   quite large.  Since we are only interested in non-graphical VMs, I
 450   was able to blacklist \texttt{virtio-gpu.ko} and that reduced the
 451   total size of the initrd.
 452 \item We use a small C init program to load the kmods and mount
 453   the root filesystem, and this must be statically linked so we
 454   don't have to include a separate libc in the initrd.  However
 455   glibc produces enormous static binaries (800KB+).  Switching to using
 456   dietlibc allows us to build the same program to a
 457   22KB binary, about $\frac{1}{40}$th of the size.
 458 \item We initially used xz-compressed kmods.  These are smaller,
 459   reducing PIO loading time (but making not a lot of difference to
 460   DMA) but they are very slow to decompress.  Switching to using
 461   uncompressed kmods produced a small reduction in boot time, and
 462   simplified the init code.
 463 \item Stripping kmods (with \texttt{strip~-g}) is very important for
 464   reducing the size of the initrd.
 465 \end{itemize}
 466
 467 The resulting initrd is about 126KB for the minimal kernel, or 347K
 468 for the standard Fedora kernel.
 469
 470 \section{libguestfs}
 471
 472 Finally there is libguestfs itself which glues everything together and
 473 provides the initial \texttt{/init} script.  There were several
 474 savings to be made:
 475
 476 \begin{itemize}
 477 \item When we are not debugging, we were still reading the verbose
 478   kernel output over the slow UART, and then throwing it away.  The
 479   solution was to add the \texttt{quiet} option.  That reduced boot
 480   time by about 1,000ms, the single largest saving.
 481 \item We used to run the \texttt{hwclock} command.  With kvmclock it
 482   turns out this is not necessary, and removing it saved 300ms.
 483 \item We used to run \texttt{qemu~-help} and \texttt{qemu~-version}.
 484   Drew Jones pointed out the obvious: the help output contains the
 485   version number, so that reduces the number of times we need to run
 486   QEMU and suffer the glibc slow link loader overhead (and in the
 487   final version of libguestfs we also memoize QEMU output, reducing it
 488   further).
 489 \item We used to run SGABIOS unconditionally, but it is only necessary
 490   to use it when debugging.  When we're not debugging we can omit it
 491   and save loading it at all.
 492 \item Running \texttt{ldconfig} in the appliance to update the link
 493   loader cache took 100ms, but we found a way that we don't need to
 494   run it at all.
 495 \end{itemize}
 496
 497 \section{Memory usage and DAX}
 498
 499 I was pleasantly surprised that Intel had implemented a virtual
 500 NVDIMM, and ext4 + DAX is also working in modern kernels, and it was a
 501 relatively trivial job to implement DAX.
 502
 503 However I'm not certain that the benefits are clear, nor that I'm
 504 measuring things correctly.
 505
 506 Inside the guest you can run \texttt{free~-m} with and without DAX:
 507
 508 \begin{verbatim}
 509               total    used    free  shared buff/cache available
 510 Without DAX:    485       3     451       1         30       465
 511 With DAX:       485       3     469       1         12       467
 512 \end{verbatim}
 513
 514 The MaxRSS of QEMU reduces by about 5~MB when DAX is enabled.
 515
 516 \section{Conclusions}
 517
 518 \begin{minipage}{\textwidth}
 519 This graph is just for a bit of fun:
 520
 521 \includegraphics[width=0.8\textwidth]{progress}
 522 \end{minipage}
 523
 524 There were a few false starts at the beginning of March (2016) where I
 525 was exploring how we might benchmark QEMU.  But once I had written the
 526 right tools to analyze the boot process, two quick wins brought the
 527 time down from 3.5~seconds to 1.2~seconds in the space of a few days.
 528 It's worth noting that the libguestfs appliance had been booting in
 529 approx.~3-4~seconds for literally half a decade.
 530
 531 Getting the time under 600ms took a few weeks longer, and without some
 532 breakthrough in the kernel or udev, I cannot see us getting the time
 533 under 500ms.
 534
 535 Performance is everyone's job, but it sometimes feels like few people
 536 care about a use case which is considered esoteric.  Yet this does
 537 affect everyone:
 538
 539 \begin{itemize}
 540 \item If we can use virtualization as an extra layer of security
 541   around operations, whether that is Docker, or Qubes, or
 542   libvirt-sandbox, or libguestfs, that benefits everyone.
 543 \item The same concerns about boot speed are raised over and over
 544   again by the embedded community.  If your digital camera is slow to
 545   switch on, it might be running initcalls for subsystems that it will
 546   never use.  (Many references here:
 547   \url{http://elinux.org/Boot_Time})
 548 \end{itemize}
 549
 550 Hopefully this paper will persuade developers to think twice before
 551 adding an unnecessary delay loop, inserting a useless boot splash
 552 screen, or creating another initcall.
 553
 554 \end{document}