6 - measurement of memory in the guest (free -m) and outside (qemu maxrss)
10 - link-loading is very slow
11 qemu -version takes 60ms
15 - loading the -kernel and -initrd takes 700ms, using "DMA"
16 makes it almost instant
18 - UART is slow (4µs / char) (3 lines of text / ms), so enabling
19 debugging changes the results
21 - SGABIOS 260ms delay, fixed by emulating a serial terminal
23 - "feature detection" takes 99ms (but most of that is
24 glibc's slow link-loader)
25 * implemented memoization in libguestfs to reduce this to 0
29 - 200ms delay waiting for qemu monitor
33 - PCI probing is slow, pci_subsys_init [acpi=off] takes 95ms
34 insmod virtio_pci takes 51ms
35 initcall virtio_pci_driver takes 22ms
36 * no real solutions here, accessing PCI config space is simply slow
37 * it NOT scanning the bus which takes time
38 * it IS probing and initializing extant devices which is slow
39 * qemu exports legacy devices which are unhelpful
40 * using ACPI moves the initialization to acpi_init, but it still
41 takes the same amount of time
43 - I implemented parallel PCI probing (per bus, but there's only 1 bus)
45 * With 1 vCPU it very slightly slows things down, as expected
46 * With 4 vCPUs it improves performance 66ms -> 39ms
47 * But the overhead of enabling multiple vCPUs totally destroys any
50 - you would think that multiple vCPUs would be better than a single
51 vCPU, but it actually has a massive negative impact
52 * Switching 1 -> 4 vCPUs increases boot time by over 200ms
53 * About 25ms is spent starting each CPU (in check_tsc_sync_target).
54 Setting tsc.reliable=1 skips this check
55 * A lot more time just goes .. somewhere, eg PCI probing gets slower
56 but for no particular reason. Because of overhead of locks?
58 - entering the kernel takes 80ms, very unclear exactly
61 - acpi=off saved 190ms
62 * Except that I wasn't able to reproduce this slowdown in later
63 versions of qemu, so I can now enable ACPI, excellent!
65 - ftrace initialization is slow and unavoidable (20ms)
67 - kernel.aesni_init takes 18ms (fixed upstream with "cryptomgr.notests")
69 - kernel.serial_8250_init takes 25ms
71 - but the main problem are lots of initcalls taking small
73 * all initcalls invoked before userspace: 690ms
75 - compiling a custom kernel. Not "minimally configured", but
76 instead I started from a Fedora kernel and removed things
77 which had a > 1ms initcall overhead and were probably not
85 * remove sound card support
88 * remove profiling support
101 initcalls-before-userspace went from ~697ms -> ~567ms
103 - I didn't go down the "long tail" of initcalls. There are many
104 many initcalls that take 0.5-0.7ms. It seems with a custom kernel
105 there is plenty of scope for reducing this further.
107 - Some things take time but can't be disabled, eg. PERF_EVENTS (3.2ms)
108 HAVE_PCSPKR_PLATFORM (1.1ms)
110 - very minimal config
111 * allnoconfig + enabling by hand just what is needed for libguestfs,
112 mostly compiled into the kernel instead of using modules
113 * very tedious locating and enabling the right options
114 * udev requires lots of kernel features
115 initcalls-before-userspace ~697ms -> ~288ms (!)
117 - It's clear from the total running time without debugging that the
118 very minimal kernel is not much different from the custom cut down
119 kernel above. There's a lot of overhead from debugging initcalls
122 ==> With a custom or minimal kernel, we can get total boot times
123 around 500-600ms, but no lower.
125 - DAX (vNVDIMM + ext4 + DAX) really works! It has a modest
126 effect on memory usage (see below). It improves boot speed
131 - udev spends about 130ms
132 * Runs lots of external commands & shell scripts.
133 Possibly we could use a cut-down udev, but the rules are
134 complex and intertwined, and of course we want to use the
135 distro udev as unmodified as possible.
136 * The time is actually taken when we run 'udevadm settle'
137 because we need the /dev directory to be populated before
138 we begin other operations. systemd might help ...
142 - building a "bios-fast.bin" variant with many features
143 disabled reduced time spent inside the BIOS from
144 63ms -> 19ms (saves 44ms).
146 - you can enable debugging, but it slows everything down
147 (because of the slow UART)
149 - PCI bus probing for disks that we will never use to boot
151 Findings (libguestfs):
153 - use kernel "quiet" option
154 reduced time from 3500ms -> 2500ms
157 - hwclock command added 300ms
159 - no need to run qemu -help & qemu -version
161 - only use SGABIOS when verbose
163 - running ldconfig takes 100ms (now fixed by copying ld.so.cache to
168 - we were adding "virtio*.ko" to the appliance, and automatic kmod
169 dependencies pulled in drm.ko (via virtio-gpu.ko), adding a
170 blacklist of modules allowed us to drop these unnecessary
171 dependencies, reducing the size of the initramfs and thus making
174 - use dietlibc instead of glibc, initramfs 2.6M -> 1.8M
176 - use uncompressed kmods instead of .ko.xz; small increase in
177 size of the initrd in return for a small reduction in boot time
179 - stripping kmods (with 'strip -g') helps to reduce the size of
180 the initramfs to around 126K for a minimal kernel, or 347K for
183 ----------------------------------------------------------------------
187 Memory usage is the other side of the coin.
189 DAX (vNVDIMM + ext4 + DAX) really works! With DAX:
192 total used free shared buff/cache available
193 Mem: 485 3 469 1 12 467
199 total used free shared buff/cache available
200 Mem: 485 3 451 1 30 465
203 I added a patch to allow us to capture the MAXRSS of the qemu process.
204 This is doing 'make quickcheck' which performs some libguestfs typical
205 activity like creating partitions and a filesystem and mounting it:
207 With DAX: 228240K / 235628K / 238104K (avg: 234M)
208 Without DAX: 245188K / 231500K / 240208K (avg: 239M)
210 ----------------------------------------------------------------------
214 #277: +416560057 [appliance] "[ 0.000000] e820: BIOS-provided physical RAM map:"
215 #278: +416560896 [appliance] "[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009f7ff] usable"
216 #279: +416561538 [appliance] "[ 0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved"
217 #280: +417698326 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved"
218 #281: +417699181 [appliance] "[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001f3fbfff] usable"
219 #282: +418835791 [appliance] "[ 0.000000] BIOS-e820: [mem 0x000000001f3fc000-0x000000001f3fffff] reserved"
220 #283: +418836781 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved"
221 #284: +418837152 [appliance] "[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved"
223 In the kernel source this is printed by a simple loop, so there's
224 almost no kernel side overhead.
226 Notice the unevenness of the timestamps, probably caused by
227 buffering inside qemu or libguestfs.
229 Anyway averaged over all the lines, we can estimate about 4µs / char
230 (Kevin O'Connor estimates 2.5µs / char on a different machine and
231 different test framework).
233 So printing out 3 full lines of text takes ~ 1ms.