2016-eng-talk/paper.tex

   1 \documentclass[12pt,a4paper]{article}
   2 \usepackage[utf8x]{inputenc}
   3 \usepackage{parskip}
   4 \usepackage{hyperref}
   5 \usepackage{xcolor}
   6 \hypersetup{
   7     colorlinks,
   8     linkcolor={red!50!black},
   9     citecolor={blue!50!black},
  10     urlcolor={blue!80!black}
  11 }
  12 \usepackage{abstract}
  13 \usepackage{graphicx}
  14 \DeclareGraphicsExtensions{.pdf,.png,.jpg}
  15 \usepackage{float}
  16 \floatstyle{boxed}
  17 \restylefloat{figure}
  18 \usepackage{fancyhdr}
  19   \pagestyle{fancy}
  20   %\fancyhead{}
  21   %\fancyfoot{}
  22
  23 \title{Optimizing QEMU boot time}
  24 \author{
  25 \large
  26 Richard W.M. Jones
  27 \normalsize Red Hat Inc.
  28 \normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com}
  29 }
  30 \date{}
  31
  32 \begin{document}
  33 \maketitle
  34
  35 \begin{abstract}
  36 Everyone knows that containers are really fast and lightweight, and
  37 full virtualization is slow and heavyweight ... Or that's what we
  38 thought, until Intel demonstrated full Linux virtual machines booting
  39 as fast as containers and using as little memory.  Intel's work used
  40 kvmtool and a customized, cut down guest kernel.  Can we do the same
  41 using libvirt, QEMU, SeaBIOS, and an off the shelf Linux distro
  42 kernel?  The short answer is \textit{no}, but we can get pretty close,
  43 and it was an exciting journey learning about unexpected performance
  44 roadblocks, developing tools to measure the boot process, and shaving
  45 off milliseconds all over the place.  The work has practical
  46 significance because it will allow us to deploy secure containers,
  47 protected by hardware virtualization.  Even if you never plan to use
  48 containers, you're still benefiting from a faster QEMU experience.
  49 \end{abstract}
  50
  51 \section{Intel Clear Linux}
  52
  53 Intel's Clear Linux means a lot of different things to different
  54 people.  I'm only going to talk about a narrow aspect of it, usually
  55 known as ``Clear Containers'', but if other people talk about Intel
  56 Clear Linux they might be talking about a Linux distribution,
  57 OpenStack or graphics technologies.
  58
  59 LWN has a useful and relatively recent introduction to Clear
  60 Containers \url{https://lwn.net/Articles/644675/}.
  61
  62 Until recently Intel hosted a Clear Containers demo.  If you
  63 downloaded it and ran \texttt{bash~./boot.sh} then it booted into a
  64 full Linux VM in about 150ms, and using 20~MB of RAM.
  65
  66 Intel are using this technology along with a customized Docker driver
  67 to run Docker containers safely inside a VM.  The overhead (150ms /
  68 20~MB) is very attractive since it doesn't impact on the density that
  69 containers give you.  It's also aligned with Intel's interests, since
  70 they are selling chips with VT, VT-d, EPT, VPID and so on and they
  71 need people to use those features.
  72
  73 The Clear Containers demo uses \texttt{kvmtool} with several
  74 non-upstream patches such as for DAX and 64 bit guests.  Since first
  75 demonstrating Clear Containers, Intel has worked on getting vNVDIMM
  76 (needed for DAX) into QEMU.
  77
  78 The Clear Containers demo from last year uses a patched Linux kernel.
  79 There are many non-upstream patches.  More importantly they use a
  80 custom, cut down configuration where many subsystems not used by VMs
  81 are cut out entirely.
  82
  83 \section{Real Linux distros use QEMU}
  84
  85 Can we do the same sort of thing in our Linux distros?  Let's talk
  86 about some things that constrain us in Fedora.
  87
  88 We'd prefer to use QEMU over kvmtool.  QEMU isn't really ``bloated''.
  89 It's featureful, but (generally) if you're not using those features
  90 they don't slow things down.
  91
  92 We \textit{can't} use the heavily patched and customized kernel.
  93 Fedora is strictly ``upstream first''.  Fedora also ships a single
  94 kernel image for baremetal, virtual machines and all other uses, since
  95 building and maintaining multiple kernels is a huge pain.
  96
  97 \section{Stating the problem}
  98
  99 What we want to do is to boot up and shut down a modern Linux kernel
 100 in a KVM virtual machine on a modern Linux host.  Inside the virtual
 101 machine we will eventually want to run our Docker container.  However
 102 I am just concentrating on the overhead of the boot and shutdown.
 103
 104 \begin{samepage}
 105 Conveniently -- and also the reason I'm interested in this problem --
 106 libguestfs does almost the same thing.  It starts up and shuts down a
 107 small Linux-based appliance.  If you have \texttt{guestfish}
 108 installed, then you can try running the command below (several times
 109 so you have a warm cache).  Add \texttt{-v~-x} to the command line to
 110 see what's really going on.
 111
 112 \begin{verbatim}
 113 $ guestfish -a /dev/null run
 114 \end{verbatim}
 115 \end{samepage}
 116
 117 \section{Measurements}
 118
 119 The first step to improving the situation is to build tools that can
 120 accurately measure the time taken for each step in the boot process.
 121
 122 Booting a Linux kernel under QEMU using the \texttt{-kernel} option
 123 looks like table~\ref{tab:kernel-steps}.
 124
 125 \begin{table}[h]
 126 \caption{Steps run when you use QEMU \texttt{-kernel}}
 127 \centering
 128 \begin{tabular}{l}
 129   query QEMU's capabilities \\
 130   \hline
 131   run QEMU \\
 132   \hline
 133   run SeaBIOS \\
 134   \hline
 135   run the kernel \\
 136   \hline
 137   run the initramfs \\
 138   \hline
 139   load kernel modules \\
 140   \hline
 141   mount and pivot to the root filesystem \\
 142   \hline
 143   run \texttt{/init}, \texttt{udevd} etc \\
 144   \hline
 145   perform the desired task \\
 146   \hline
 147   shutdown \\
 148   \hline
 149   exit QEMU
 150 \end{tabular}
 151 \label{tab:kernel-steps}
 152 \end{table}
 153
 154 How do you know when SeaBIOS starts or various kernel events happen?
 155
 156 I started out looking at various QEMU tracing options, but ended up
 157 using a very simple technique: Attach a serial console to QEMU,
 158 timestamp the messages as they arrive, and use regular expression
 159 string matches to find significant events.
 160
 161 The three programs I wrote (two in C and one in Perl) use libguestfs
 162 as a convenient framework, since libguestfs has the machinery already
 163 for creating VMs, initramfses, capturing serial console output etc.
 164 They are:
 165
 166 \begin{itemize}
 167 \item \texttt{boot-benchmark}
 168
 169 \texttt{boot-benchmark} runs the boot up sequence repeatedly, throwing
 170 away the first few runs (to warm the cache) and collecting the mean
 171 test time and standard deviation.
 172
 173 \begin{verbatim}
 174 $ ./boot-benchmark
 175 Warming up the libguestfs cache ...
 176 Running the tests ...
 177
 178 test version: libguestfs 1.33.29
 179  test passes: 10
 180 host version: Linux moo.home.annexia.org 4.4.4-301.fc23.x86_64 #1 SMP
 181     host CPU: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
 182      backend: direct               [to change set $LIBGUESTFS_BACKEND]
 183         qemu: /home/rjones/d/qemu/x86_64-softmmu/qemu-system-x86_64
 184 qemu version: QEMU emulator version 2.6.50, Copyright (c) 2003-2008
 185          smp: 1                    [to change use --smp option]
 186      memsize: 500                  [to change use --memsize option]
 187       append:                      [to change use --append option]
 188
 189 Result: 568.2ms ±8.7ms
 190 \end{verbatim}
 191
 192 \item \texttt{boot-benchmark-range.pl}
 193
 194 \texttt{boot-benchmark-range.pl} is a wrapper script around
 195 \texttt{boot-benchmark} which lets you benchmark across a range of
 196 commits from some other project (eg. QEMU or the kernel).  You can
 197 easily see which commits are causing or solving performance problems
 198 as in the example below:
 199
 200 \begin{verbatim}
 201 $ ./boot-benchmark-range.pl ~/d/qemu 3123bd8^..8e86aa8
 202 da34fed hw/ppc/spapr: Fix crash when specifying bad[...]
 203         1666.8ms ±2.5ms
 204
 205 3123bd8 Merge remote-tracking branch 'remotes/dgibson/[...]
 206         1658.8ms ±4.2ms
 207
 208 f419a62 (origin/master, origin/HEAD, master) usb/uhci: move[...]
 209         1671.3ms ±17.0ms
 210
 211 8e86aa8 Add optionrom compatible with fw_cfg DMA version
 212         1013.7ms ±3.0ms ↑ improves performance by 64.9%
 213 \end{verbatim}
 214
 215 \item \texttt{boot-analysis}
 216
 217 \begin{figure}[h]
 218 \caption{boot-analysis timeline}
 219 \includegraphics[width=0.9\textwidth]{boot-analysis-screenshot}
 220 \label{fig:ba-timeline}
 221 \end{figure}
 222
 223 \texttt{boot-analysis} performs multiple runs of the boot sequence.
 224 It enables the QEMU serial console (and other events from libguestfs),
 225 timestamps the events, and then presents the sequence graphically as
 226 shown in figure~\ref{fig:ba-timeline}.  Also shown are mean times and standard
 227 deviations and percentage of the total run time.
 228
 229 \begin{figure}[h]
 230 \caption{boot-analysis longest activities}
 231 \includegraphics[width=0.9\textwidth]{boot-analysis-screenshot-2}
 232 \label{fig:ba-longest}
 233 \end{figure}
 234
 235 This test also prints which activities took the longest time, see
 236 figure~\ref{fig:ba-longest}.
 237
 238 \end{itemize}
 239
 240 The source for these tools is here:
 241 \url{https://github.com/libguestfs/libguestfs/tree/master/utils}.
 242
 243 Only now that we have the right tools to hand can we work out what
 244 activities take time.
 245
 246 For consistency, all times displayed by the tool are in milliseconds
 247 (ms), and I try to use the same convention in this paper.
 248
 249 In this paper I'm using times based on my laptop, an
 250 Intel\textregistered Core\texttrademark i7-5600U CPU @ 2.60GHz
 251 (Broadwell~U).  This does of course mean that these results won't be
 252 exactly reproducible, but it is hoped that with similar hardware you
 253 will get times that differ only by a scale factor.
 254
 255 \section{glibc}
 256
 257 Surprisingly the first problem is glibc.  QEMU links to over 170
 258 libraries, and that number keeps growing.  A simple
 259 \texttt{qemu~-version} takes up to 60ms, and examining this with
 260 \texttt{perf} showed two things:
 261
 262 \begin{itemize}
 263 \item Ceph had a bug where it ran some \texttt{rdtsc} benchmarks in a
 264   constructor function.  This is now fixed.
 265 \item The glibc link loader is really slow when presented with lots of
 266   libraries and lots of symbols.
 267 \end{itemize}
 268
 269 The second problem is intractable.  We can't link to fewer libraries,
 270 because each of those libraries represents some feature that someone
 271 wants, like Ceph, or Gtk support (though if you remove the Gtk
 272 dependency the link time reduces substantially).  And the link loader
 273 is bound by all sorts of obscure ELF rules (eg. symbol interposition)
 274 which we don't need but cannot avoid and make things slow.
 275
 276 When I said earlier that QEMU features don't slow things down, this is
 277 an exception.
 278
 279 We can run QEMU fewer times.  There are several places where we need
 280 to run QEMU.  Obviously one place is where we start the virtual
 281 machine, and the overhead there cannot be avoided.  But also we
 282 perform QEMU feature detection by running commands like
 283 \texttt{qemu~-help} and \texttt{qemu~-devices~\textbackslash?} and
 284 libguestfs now caches that output.
 285
 286 \section{QEMU}
 287
 288 Libguestfs, Intel Clear Containers, and any future Docker container
 289 support we build will use \texttt{-kernel} and \texttt{-initrd} or
 290 their equivalent.  In QEMU up to 2.6 on x86-64 this was implemented
 291 using an interface called \texttt{fw\_cfg} and a PIO loop, and that is
 292 very slow.  To load the kernel and very small initrd used by
 293 libguestfs takes around 700ms.  In QEMU 2.7 we have added a pseudo-DMA
 294 mode which makes this step almost instant.
 295
 296 To see debugging messages from the kernel and to collect our benchmark
 297 results, we have to use an emulated 16550A UART (serial port).
 298 Virtio-console exists but isn't a good replacement because it can't be
 299 used to get BIOS and very early kernel messages.  The UART is slow.
 300 It takes about 4µs per character, or approximately 1ms for 3 lines of
 301 text.  Enabling debugging changes the results subtly.
 302
 303 To get serial console output from the BIOS, we use a
 304 Google-contributed option ROM called SGABIOS.  It quickly became clear
 305 that SGABIOS introduced a 260ms boot delay.  This happened because it
 306 expects to be talking to a real serial terminal, so it sends control
 307 sequences to query the width and height of this ``terminal''.  These
 308 weren't being answered by the actual reader (libguestfs simply reads).
 309 The solution was to modify libguestfs to respond to the control
 310 sequence with a dummy reply, which reduced the delay to almost
 311 nothing.
 312
 313 \section{libvirt}
 314
 315 Libguestfs can optionally use libvirt to manage the QEMU process.
 316 When I did this it was obvious that libvirt was adding a (precisely)
 317 200ms delay.  I tracked this down to a poorly implemented polling loop
 318 in libvirt, waiting for the QEMU monitor socket to be created by QEMU.
 319 I fixed it by changing the loop to use exponential backoff.  A better
 320 fix would involve passing pre-created file descriptors to QEMU.
 321
 322 \section{SeaBIOS}
 323
 324 SeaBIOS wastes time probing for boot devices even though we will use
 325 the \texttt{linuxboot} option ROM to boot (via \texttt{-kernel}).  By
 326 building a \texttt{bios-fast.bin} variant of SeaBIOS with many unused
 327 features disabled we can reduce the time spent inside the BIOS from
 328 about 63ms to about 19ms.
 329
 330 \section{kernel}
 331
 332 PCI probing is slow, taking around 95ms for a guest with just two
 333 virtio-scsi drives.  It turns out that it's not the scanning of the
 334 PCI device space which is slow, but the initialization of each device
 335 as it is found.  QEMU's i440fx machine model exports some legacy
 336 devices which cannot be switched off, and that is unhelpful.
 337
 338 I implemented experimental support for parallel PCI probing using the
 339 kernel ``async'' feature.  With 1~vCPU this slows things down very
 340 slightly as expected.  With 4~vCPUs performance improved by about
 341 30\%.  Unfortunately we can't use it because of the next point.
 342
 343 You would think multiple vCPUs would be better and faster than 1~vCPU,
 344 but that is not the case.  It actually has a large negative impact on
 345 performance.  Switching from 1 to 4~vCPUs increases the boot time by
 346 over 200ms.  About 25ms is spent starting each secondary CPU (in
 347 \texttt{check\_tsc\_sync\_target}).  This can be avoided by setting
 348 \texttt{tsc.reliable=1} but no one can tell me if this is safe.  But
 349 most of the extra time just disappears between the cracks -- for
 350 example, PCI probing just slows down, but for no readily apparent
 351 reason.  It seems as if the overhead of spinlocks or RCU or whatever
 352 hurts general performance.  Or perhaps there is some scheduling
 353 problem on the host since it only has 4 physical CPUs.
 354
 355 When the kernel runs, it does some BIOS stuff, and there's a long
 356 delay (about 80ms) before \texttt{start\_kernel} is entered.
 357
 358 Another unavoidable overhead is \texttt{ftrace} which must modify
 359 every function in the kernel.  This takes 20ms.  You can't disable
 360 ftrace at run time, the only option is to compile it out, but that
 361 breaks so many useful features that we'd never persuade a distro
 362 kernel to do that.
 363
 364 If your kernel has crypto functions, then it will spend 18ms testing
 365 them at boot.  Herbert Xu accepted my patch to add a
 366 \texttt{cryptomgr.notests} flag which bypasses this.
 367
 368 As we are presenting an emulated 16550A UART,
 369 \texttt{serial\_8250\_init} runs, and this spends 25ms checking that
 370 the UART is really a 16650A (does it work, does it have a FIFO?), and
 371 (unsurprisingly) yes it is.  This is a totally useless waste of time,
 372 but I have not managed to come up with a patch or even with an
 373 approach for how to avoid this that is acceptable upstream.
 374
 375 But the main problem is none of the above.  It's simply the small
 376 amount of time taken to run many many initcalls.  For a distro kernel
 377 this can be around 690ms (with serial debugging enabled which
 378 exaggerates the effect somewhat).  One way to avoid that would be to
 379 compile some sort of custom kernel, and even though this approach is
 380 not acceptable for Fedora I did explore this, trying both a cut down
 381 distro kernel, and also a super-minimal kernel.
 382
 383 \begin{itemize}
 384 \item The cut down distro kernel works by removing any subsystem that
 385   has a $>$~1ms initcall overhead.  These include:
 386   \begin{itemize}
 387   \item auditing
 388   \item big\_key
 389   \item ftrace
 390   \item hugetlbfs
 391   \item input\_leds
 392   \item joydev
 393   \item joysticks
 394   \item keyboards
 395   \item kprobes
 396   \item libata
 397   \item mice
 398   \item microcode
 399   \item netlabel
 400   \item profiling support
 401   \item quota
 402   \item rtc\_drv\_cmos
 403   \item sound card support
 404   \item tablets
 405   \item touchscreens
 406   \item USB
 407   \item zbud
 408   \item zswap
 409   \end{itemize}
 410   That reduces the time taken running initcalls before userspace by
 411   about 20\%.  There is some scope for reducing this a bit more by
 412   going even further down the ``long tail'' of subsystems.
 413 \item For my second test I started with an absolutely minimal kernel
 414   config (\texttt{allnoconfig}), and built up the configuration until
 415   I got something that booted.  That reduces the time taken running
 416   initcalls before userspace by about 60\% (down to 288ms).
 417 \end{itemize}
 418
 419 With a minimal kernel, we can get total boot times down to the
 420 500-600ms range, but not any lower.
 421
 422 \section{udev}
 423
 424 udevd takes about 130ms to populate \texttt{/dev}.
 425
 426 The rules are monolithic, entwined together and resist modification,
 427 and starting a new set of rules from scratch looks like it would be a
 428 constant game of catch up.
 429
 430 \section{initrd}
 431
 432 We use a program called supermin
 433 (\url{http://libguestfs.org/supermin.1.html}) to construct the initrd
 434 which is responsible for loading enough kmods to mount the real root
 435 filesystem and pivoting into it.
 436
 437 Because of PIO loading of the initrd in earlier versions of QEMU, it
 438 was very important to construct as small an initrd as possible, and
 439 supermin was not doing a very good job of that.  However once I
 440 started to analyze the situation there were some easy wins (now all
 441 upstream):
 442
 443 \begin{itemize}
 444 \item We were adding all virtio kmods to the appliance plus any
 445   dependencies, with the starting set being constructed using the
 446   wildcard ``\texttt{virtio*.ko}''.  The wildcard pulls in
 447   \texttt{virtio-gpu.ko} which depends on \texttt{drm.ko} and both are
 448   quite large.  Since we are only interested in non-graphical VMs, I
 449   was able to blacklist \texttt{virtio-gpu.ko} and that reduced the
 450   total size of the initrd.
 451 \item We use a small C init program to load the kmods and mount
 452   the root filesystem, and this must be statically linked so we
 453   don't have to include a separate libc in the initrd.  However
 454   glibc produces enormous static binaries (800KB+).  Switching to using
 455   dietlibc allows us to build the same program to a
 456   22KB binary, about $\frac{1}{40}$th of the size.
 457 \item We initially used xz-compressed kmods.  These are smaller,
 458   reducing PIO loading time (but making not a lot of difference to
 459   DMA) but they are very slow to decompress.  Switching to using
 460   uncompressed kmods produced a small reduction in boot time, and
 461   simplified the init code.
 462 \item Stripping kmods (with \texttt{strip~-g}) is very important for
 463   reducing the size of the initrd.
 464 \end{itemize}
 465
 466 The resulting initrd is about 126KB for the minimal kernel, or 347K
 467 for the standard Fedora kernel.
 468
 469 \section{libguestfs}
 470
 471 Finally there is libguestfs itself which glues everything together and
 472 provides the initial \texttt{/init} script.  There were several
 473 savings to be made:
 474
 475 \begin{itemize}
 476 \item When we are not debugging, we were still reading the verbose
 477   kernel output over the slow UART, and then throwing it away.  The
 478   solution was to add the \texttt{quiet} option.  That reduced boot
 479   time by about 1,000ms, the single largest saving.
 480 \item We used to run the \texttt{hwclock} command.  With kvmclock it
 481   turns out this is not necessary, and removing it saved 300ms.
 482 \item We used to run \texttt{qemu~-help} and \texttt{qemu~-version}.
 483   Drew Jones pointed out the obvious: the help output contains the
 484   version number, so that reduces the number of times we need to run
 485   QEMU and suffer the glibc slow link loader overhead (and in the
 486   final version of libguestfs we also memoize QEMU output, reducing it
 487   further).
 488 \item We used to run SGABIOS unconditionally, but it is only necessary
 489   to use it when debugging.  When we're not debugging we can omit it
 490   and save loading it at all.
 491 \item Running \texttt{ldconfig} in the appliance to update the link
 492   loader cache took 100ms, but we found a way that we don't need to
 493   run it at all.
 494 \end{itemize}
 495
 496 \section{Memory usage and DAX}
 497
 498 I was pleasantly surprised that Intel had implemented a virtual
 499 NVDIMM, and ext4 + DAX is also working in modern kernels, and it was a
 500 relatively trivial job to implement DAX.
 501
 502 However I'm not certain that the benefits are clear, nor that I'm
 503 measuring things correctly.
 504
 505 Inside the guest you can run \texttt{free~-m} with and without DAX:
 506
 507 \begin{verbatim}
 508               total    used    free  shared buff/cache available
 509 Without DAX:    485       3     451       1         30       465
 510 With DAX:       485       3     469       1         12       467
 511 \end{verbatim}
 512
 513 The MaxRSS of QEMU reduces by about 5~MB when DAX is enabled.
 514
 515 \section{Conclusions}
 516
 517 \begin{minipage}{\textwidth}
 518 This graph is just for a bit of fun:
 519
 520 \includegraphics[width=0.8\textwidth]{progress}
 521 \end{minipage}
 522
 523 There were a few false starts at the beginning of March (2016) where I
 524 was exploring how we might benchmark QEMU.  But once I had written the
 525 right tools to analyze the boot process, two quick wins brought the
 526 time down from 3.5~seconds to 1.2~seconds in the space of a few days.
 527 It's worth noting that the libguestfs appliance had been booting in
 528 approx.~3-4~seconds for literally half a decade.
 529
 530 Getting the time under 600ms took a few weeks longer, and without some
 531 breakthrough in the kernel or udev, I cannot see us getting the time
 532 under 500ms.
 533
 534 Performance is everyone's job, but it sometimes feels like few people
 535 care about a use case which is considered esoteric.  Yet this does
 536 affect everyone:
 537
 538 \begin{itemize}
 539 \item If we can use virtualization as an extra layer of security
 540   around operations, whether that is Docker, or Qubes, or
 541   libvirt-sandbox, or libguestfs, that benefits everyone.
 542 \item The same concerns about boot speed are raised over and over
 543   again by the embedded community.  If your digital camera is slow to
 544   switch on, it might be running initcalls for subsystems that it will
 545   never use.  (Many references here:
 546   \url{http://elinux.org/Boot_Time})
 547 \end{itemize}
 548
 549 Hopefully this paper will persuade developers to think twice before
 550 adding an unnecessary delay loop, inserting a useless boot splash
 551 screen, or creating another initcall.
 552
 553 \end{document}