Tools: - boot-benchmark - boot-benchmark-range - boot-analysis - measurement of memory in the guest (free -m) and outside (qemu maxrss) Findings (glibc): - link-loading is very slow qemu -version takes 60ms Findings (qemu): - loading the -kernel and -initrd takes 700ms, using "DMA" makes it almost instant - UART is slow (4µs / char) (3 lines of text / ms), so enabling debugging changes the results - SGABIOS 260ms delay, fixed by emulating a serial terminal - "feature detection" takes 99ms (but most of that is glibc's slow link-loader) * implemented memoization in libguestfs to reduce this to 0 Findings (libvirt): - 200ms delay waiting for qemu monitor Findings (kernel): - PCI probing is slow, pci_subsys_init [acpi=off] takes 95ms insmod virtio_pci takes 51ms initcall virtio_pci_driver takes 22ms * no real solutions here, accessing PCI config space is simply slow * it NOT scanning the bus which takes time * it IS probing and initializing extant devices which is slow * qemu exports legacy devices which are unhelpful * using ACPI moves the initialization to acpi_init, but it still takes the same amount of time - I implemented parallel PCI probing (per bus, but there's only 1 bus) using kernel/async.c * With 1 vCPU it very slightly slows things down, as expected * With 4 vCPUs it improves performance 66ms -> 39ms * But the overhead of enabling multiple vCPUs totally destroys any benefit (see below). - you would think that multiple vCPUs would be better than a single vCPU, but it actually has a massive negative impact * Switching 1 -> 4 vCPUs increases boot time by over 200ms * About 25ms is spent starting each CPU (in check_tsc_sync_target). Setting tsc.reliable=1 skips this check * A lot more time just goes .. somewhere, eg PCI probing gets slower but for no particular reason. Because of overhead of locks? - entering the kernel takes 80ms, very unclear exactly what it's doing - acpi=off saved 190ms * Except that I wasn't able to reproduce this slowdown in later versions of qemu, so I can now enable ACPI, excellent! - ftrace initialization is slow and unavoidable (20ms) - kernel.aesni_init takes 18ms (fixed upstream with "cryptomgr.notests") - kernel.serial_8250_init takes 25ms - but the main problem are lots of initcalls taking small amounts of time * all initcalls invoked before userspace: 690ms - compiling a custom kernel. Not "minimally configured", but instead I started from a Fedora kernel and removed things which had a > 1ms initcall overhead and were probably not used by my appliance: * remove ftrace * remove hugetlbfs * remove libata * remove netlabel * remove quota * remove rtc_drv_cmos * remove sound card support * remove auditing * remove kprobes * remove profiling support * remove zbud * remove big_key * remove joydev * remove keyboards * remove mice * remove joysticks * remove tablets * remove touchscreens * remove microcode * remove USB * remove zswap * remove input_leds initcalls-before-userspace went from ~697ms -> ~567ms - I didn't go down the "long tail" of initcalls. There are many many initcalls that take 0.5-0.7ms. It seems with a custom kernel there is plenty of scope for reducing this further. - Some things take time but can't be disabled, eg. PERF_EVENTS (3.2ms) HAVE_PCSPKR_PLATFORM (1.1ms) - very minimal config * allnoconfig + enabling by hand just what is needed for libguestfs, mostly compiled into the kernel instead of using modules * very tedious locating and enabling the right options * udev requires lots of kernel features initcalls-before-userspace ~697ms -> ~288ms (!) - It's clear from the total running time without debugging that the very minimal kernel is not much different from the custom cut down kernel above. There's a lot of overhead from debugging initcalls / slow UART. ==> With a custom or minimal kernel, we can get total boot times around 500-600ms, but no lower. - DAX (vNVDIMM + ext4 + DAX) really works! It has a modest effect on memory usage (see below). It improves boot speed by about 20-30ms. Findings (udev): - udev spends about 130ms * Runs lots of external commands & shell scripts. Possibly we could use a cut-down udev, but the rules are complex and intertwined, and of course we want to use the distro udev as unmodified as possible. * The time is actually taken when we run 'udevadm settle' because we need the /dev directory to be populated before we begin other operations. systemd might help ... Findings (seabios): - building a "bios-fast.bin" variant with many features disabled reduced time spent inside the BIOS from 63ms -> 19ms (saves 44ms). - you can enable debugging, but it slows everything down (because of the slow UART) - PCI bus probing for disks that we will never use to boot Findings (libguestfs): - use kernel "quiet" option reduced time from 3500ms -> 2500ms UART is slow - hwclock command added 300ms - no need to run qemu -help & qemu -version - only use SGABIOS when verbose - running ldconfig takes 100ms (now fixed by copying ld.so.cache to appliance) Findings (supermin): - we were adding "virtio*.ko" to the appliance, and automatic kmod dependencies pulled in drm.ko (via virtio-gpu.ko), adding a blacklist of modules allowed us to drop these unnecessary dependencies, reducing the size of the initramfs and thus making things faster - use dietlibc instead of glibc, initramfs 2.6M -> 1.8M - use uncompressed kmods instead of .ko.xz; small increase in size of the initrd in return for a small reduction in boot time - stripping kmods (with 'strip -g') helps to reduce the size of the initramfs to around 126K for a minimal kernel, or 347K for a distro kernel ---------------------------------------------------------------------- Memory usage: Memory usage is the other side of the coin. DAX (vNVDIMM + ext4 + DAX) really works! With DAX: > free -m total used free shared buff/cache available Mem: 485 3 469 1 12 467 Swap: 0 0 0 Without DAX: > free -m total used free shared buff/cache available Mem: 485 3 451 1 30 465 Swap: 0 0 0 I added a patch to allow us to capture the MAXRSS of the qemu process. This is doing 'make quickcheck' which performs some libguestfs typical activity like creating partitions and a filesystem and mounting it: With DAX: 228240K / 235628K / 238104K (avg: 234M) Without DAX: 245188K / 231500K / 240208K (avg: 239M) ---------------------------------------------------------------------- UART calculation: #277: +416560057 [appliance] "[ 0.000000] e820: BIOS-provided physical RAM map:" #278: +416560896 [appliance] "[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable" #279: +416561538 [appliance] "[ 0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved" #280: +417698326 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved" #281: +417699181 [appliance] "[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001f3fbfff] usable" #282: +418835791 [appliance] "[ 0.000000] BIOS-e820: [mem 0x000000001f3fc000-0x000000001f3fffff] reserved" #283: +418836781 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved" #284: +418837152 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved" In the kernel source this is printed by a simple loop, so there's almost no kernel side overhead. Notice the unevenness of the timestamps, probably caused by buffering inside qemu or libguestfs. Anyway averaged over all the lines, we can estimate about 4µs / char (Kevin O'Connor estimates 2.5µs / char on a different machine and different test framework). So printing out 3 full lines of text takes ~ 1ms.