Tools:

  - boot-benchmark
  - boot-benchmark-range
  - boot-analysis
  - measurement of memory in the guest (free -m) and outside (qemu maxrss)

Findings (glibc):

  - link-loading is very slow
    qemu -version takes 60ms

Findings (qemu):

  - loading the -kernel and -initrd takes 700ms, using "DMA"
    makes it almost instant

  - UART is slow (4µs / char) (3 lines of text / ms), so enabling
    debugging changes the results

  - SGABIOS 260ms delay, fixed by emulating a serial terminal

  - "feature detection" takes 99ms (but most of that is
    glibc's slow link-loader)
     * implemented memoization in libguestfs to reduce this to 0

Findings (libvirt):

  - 200ms delay waiting for qemu monitor

Findings (kernel):

  - PCI probing is slow, pci_subsys_init [acpi=off] takes 95ms
    insmod virtio_pci takes 51ms
    initcall virtio_pci_driver takes 22ms
     * no real solutions here, accessing PCI config space is simply slow
     * it NOT scanning the bus which takes time
     * it IS probing and initializing extant devices which is slow
     * qemu exports legacy devices which are unhelpful
     * using ACPI moves the initialization to acpi_init, but it still
       takes the same amount of time

  - I implemented parallel PCI probing (per bus, but there's only 1 bus)
    using kernel/async.c
     * With 1 vCPU it very slightly slows things down, as expected
     * With 4 vCPUs it improves performance 66ms -> 39ms
     * But the overhead of enabling multiple vCPUs totally destroys any
       benefit (see below).

  - you would think that multiple vCPUs would be better than a single
    vCPU, but it actually has a massive negative impact
     * Switching 1 -> 4 vCPUs increases boot time by over 200ms
     * About 25ms is spent starting each CPU (in check_tsc_sync_target).
       Setting tsc.reliable=1 skips this check
     * A lot more time just goes .. somewhere, eg PCI probing gets slower
       but for no particular reason.  Because of overhead of locks?

  - entering the kernel takes 80ms, very unclear exactly
    what it's doing

  - acpi=off saved 190ms
    * Except that I wasn't able to reproduce this slowdown in later
      versions of qemu, so I can now enable ACPI, excellent!

  - ftrace initialization is slow and unavoidable (20ms)

  - kernel.aesni_init takes 18ms (fixed upstream with "cryptomgr.notests")

  - kernel.serial_8250_init takes 25ms

  - but the main problem are lots of initcalls taking small
    amounts of time
     * all initcalls invoked before userspace: 690ms

  - compiling a custom kernel.  Not "minimally configured", but
    instead I started from a Fedora kernel and removed things
    which had a > 1ms initcall overhead and were probably not
    used by my appliance:
     * remove ftrace
     * remove hugetlbfs
     * remove libata
     * remove netlabel
     * remove quota
     * remove rtc_drv_cmos
     * remove sound card support
     * remove auditing
     * remove kprobes
     * remove profiling support
     * remove zbud
     * remove big_key
     * remove joydev
     * remove keyboards
     * remove mice
     * remove joysticks
     * remove tablets
     * remove touchscreens
     * remove microcode
     * remove USB
     * remove zswap
     * remove input_leds
    initcalls-before-userspace went from ~697ms -> ~567ms

  - I didn't go down the "long tail" of initcalls.  There are many
    many initcalls that take 0.5-0.7ms.  It seems with a custom kernel
    there is plenty of scope for reducing this further.

  - Some things take time but can't be disabled, eg. PERF_EVENTS (3.2ms)
    HAVE_PCSPKR_PLATFORM (1.1ms)

  - very minimal config
     * allnoconfig + enabling by hand just what is needed for libguestfs,
       mostly compiled into the kernel instead of using modules
     * very tedious locating and enabling the right options
     * udev requires lots of kernel features
    initcalls-before-userspace ~697ms -> ~288ms (!)

  - It's clear from the total running time without debugging that the
    very minimal kernel is not much different from the custom cut down
    kernel above.  There's a lot of overhead from debugging initcalls
    / slow UART.

 ==> With a custom or minimal kernel, we can get total boot times
     around 500-600ms, but no lower.

  - DAX (vNVDIMM + ext4 + DAX) really works!  It has a modest
    effect on memory usage (see below).  It improves boot speed
    by about 20-30ms.

Findings (udev):

  - udev spends about 130ms
    * Runs lots of external commands & shell scripts.
      Possibly we could use a cut-down udev, but the rules are
      complex and intertwined, and of course we want to use the
      distro udev as unmodified as possible.
    * The time is actually taken when we run 'udevadm settle'
      because we need the /dev directory to be populated before
      we begin other operations.  systemd might help ...

Findings (seabios):

  - building a "bios-fast.bin" variant with many features
    disabled reduced time spent inside the BIOS from
    63ms -> 19ms (saves 44ms).

  - you can enable debugging, but it slows everything down
    (because of the slow UART)

  - PCI bus probing for disks that we will never use to boot

Findings (libguestfs):

  - use kernel "quiet" option
    reduced time from 3500ms -> 2500ms
    UART is slow

  - hwclock command added 300ms

  - no need to run qemu -help & qemu -version

  - only use SGABIOS when verbose

  - running ldconfig takes 100ms (now fixed by copying ld.so.cache to
    appliance)

Findings (supermin):

  - we were adding "virtio*.ko" to the appliance, and automatic kmod
    dependencies pulled in drm.ko (via virtio-gpu.ko), adding a
    blacklist of modules allowed us to drop these unnecessary
    dependencies, reducing the size of the initramfs and thus making
    things faster

  - use dietlibc instead of glibc, initramfs 2.6M -> 1.8M

  - use uncompressed kmods instead of .ko.xz; small increase in
    size of the initrd in return for a small reduction in boot time

  - stripping kmods (with 'strip -g') helps to reduce the size of
    the initramfs to around 126K for a minimal kernel, or 347K for
    a distro kernel

----------------------------------------------------------------------

Memory usage:

Memory usage is the other side of the coin.

DAX (vNVDIMM + ext4 + DAX) really works!  With DAX:

><rescue> free -m
              total        used        free      shared  buff/cache   available
Mem:            485           3         469           1          12         467
Swap:             0           0           0

Without DAX:

><rescue> free -m
              total        used        free      shared  buff/cache   available
Mem:            485           3         451           1          30         465
Swap:             0           0           0

I added a patch to allow us to capture the MAXRSS of the qemu process.
This is doing 'make quickcheck' which performs some libguestfs typical
activity like creating partitions and a filesystem and mounting it:

With DAX:    228240K / 235628K / 238104K (avg: 234M)
Without DAX: 245188K / 231500K / 240208K (avg: 239M)

----------------------------------------------------------------------

UART calculation:

    #277: +416560057 [appliance] "[    0.000000] e820: BIOS-provided physical RAM map:"
    #278: +416560896 [appliance] "[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable"
    #279: +416561538 [appliance] "[    0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved"
    #280: +417698326 [appliance] "[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved"
    #281: +417699181 [appliance] "[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001f3fbfff] usable"
    #282: +418835791 [appliance] "[    0.000000] BIOS-e820: [mem 0x000000001f3fc000-0x000000001f3fffff] reserved"
    #283: +418836781 [appliance] "[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved"
    #284: +418837152 [appliance] "[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved"

In the kernel source this is printed by a simple loop, so there's
almost no kernel side overhead.

Notice the unevenness of the timestamps, probably caused by
buffering inside qemu or libguestfs.

Anyway averaged over all the lines, we can estimate about 4µs / char
(Kevin O'Connor estimates 2.5µs / char on a different machine and
different test framework).

So printing out 3 full lines of text takes ~ 1ms.