2 % Can we use nbdkit to test kernel limits?
3 % - max size (already mostly covered)
4 % - max number of partitions:
5 % try 128 partitions with GPT
6 % then modify GPT code to try > 128 partitions
8 \documentclass[12pt,a4paper]{article}
9 \usepackage[utf8x]{inputenc}
15 linkcolor={red!50!black},
16 citecolor={blue!50!black},
17 urlcolor={blue!80!black}
20 %\usepackage{graphicx}
21 %\DeclareGraphicsExtensions{.pdf,.png,.jpg}
31 \title{Better loopback mounts with NBD}
35 \normalsize Red Hat Inc.
36 \normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com}
44 Loopback mounts let you mount a raw file as a device. Network Block
45 Device with the nbdkit server takes this concept to the next level.
46 You can mount compressed files. Create block devices from
47 concatenated files. Mount esoteric formats like VMDK. NBD can also
48 be used for testing: You can create giant devices for testing. Inject
49 errors on demand into your block devices to test error detection and
50 recovery. Add delays to make disks deliberately slow. I will also
51 show you how to write block devices using shell scripts, and do
52 advanced visualization of how the kernel and filesystems use block
57 \section{Network Block Device}
59 \textit{In the talk there will be an introduction to and history of
60 Network Block Device. I'm not reproducing that here since you can
61 read about the history in articles such as
62 \url{https://www.linuxjournal.com/article/3778}. There will also be
63 a short introduction to nbdkit, our pluggable, scriptable NBD
64 server. For now, see \url{https://github.com/libguestfs/nbdkit}. }
67 \section{Loopback mounts -- simple but very limited}
69 Loopback mounting a file is simple:
72 # truncate -s 10M /tmp/test.img
73 # mke2fs -t ext2 /tmp/test.img
74 # losetup -f /tmp/test.img
75 # blockdev --getsize64 /dev/loop0
77 # mount /dev/loop0 /mnt
80 But this talk is about all the things you \textit{cannot} do with a
81 loopback mount. What if the file you want to mount is compressed?
82 What if you want to concatenate several files? What if you want to
83 use another type of storage instead of a file?
85 You can't do those things with a loopback mount, but there is now an
86 alternative: A loopback Network Block Device, backed by our pluggable,
87 scriptable \textbf{nbdkit} server. It's just as simple to use as
88 loopback mounts, but far more flexible.
93 If you want to follow these examples on your own machine, you will
94 need to install the \texttt{nbd-client} package (on Fedora:
95 \texttt{nbd}), and the \texttt{nbdkit} server. Most examples require
98 Linux Network Block Device is in general very reliable, but there were
99 unfortunately a couple bugs in the latest released version that is
100 present in several Linux distributions (but fixed upstream).
102 If your Linux distro ships with NBD 3.17, make sure it includes the
103 following post-3.17 fix for kernel timeouts:
104 \url{https://github.com/NetworkBlockDevice/nbd/pull/82}
106 If your Linux distro uses kernel $<4.17$, then upgrading to
107 4.17 or above is recommended.
109 You may also need to run this command once before you start:
116 \section{Mounting xz-compressed disks}
118 Loopback mounting a compressed disk will expose a block device
119 containing the compressed data, which is not very useful.
121 nbdkit has a couple of plugins for handling gzip and xz compressed
122 disks. The xz plugin is quite efficient, allowing read-only random
123 access to compressed files:
126 # nbdkit xz fedora-26.xz
129 We can make a loopback mount called \texttt{/dev/nbd0} using one
133 # nbd-client -b 512 localhost 10809 /dev/nbd0
136 Linux automatically creates block devices for each partition in the
137 original (Fedora 26) disk image:
141 nbd0 nbd0p1 nbd0p2 nbd0p3
142 # file -bsL /dev/nbd0p3
143 SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)
144 # mount /dev/nbd0p3 /mnt
145 mount: /mnt: WARNING: device write-protected, mounted read-only.
146 # cat /mnt/etc/redhat-release
147 Fedora release 26 (Twenty Six)
154 # nbd-client -d /dev/nbd0
159 \section{Creating a huge btrfs filesystem in memory}
161 nbdkit is not limited to serving files or even to the limits of disk
162 space. You can create enormous filesystems in memory:
165 # nbdkit memory size=$(( 2**63 - 1 ))
166 # nbd-client -b 512 localhost 10809 /dev/nbd0
169 How big is this? $2^{63}-1$ is about 8.5~billion gigabytes. If you
170 were to buy that amount of disk at retail it would cost you
171 \textbf{\euro~300~million}\footnote{September 2018 prices, WD Red SATA
172 drives bought on Amazon.fr}.
174 We can partition and create a filesystem just like any other device:
178 Number Start (sector) End (sector) Size Code Name
179 1 1024 9007199254740973 8.0 EiB 8300 Linux filesystem
180 # mkfs.btrfs -K /dev/nbd0p1
181 # mount /dev/nbd0p1 /mnt
183 Filesystem Size Used Avail Use% Mounted on
184 /dev/nbd0p1 8.0E 17M 8.0E 1% /mnt
187 When you unmount the NBD partition and kill nbdkit, the device is
188 gone, making this very useful for testing filesystems.
191 \section{Concatenating files into a partitioned disk}
193 Whereas loopback mounts are limited to a single file, there are
194 several nbdkit plugins for combining files. One of them is called the
195 ``partitioning'' plugin, and it turns partitions into disk images:
198 $ nbdkit partitioning \
204 This time I'll use \texttt{guestfish} to examine this virtual disk:
207 $ guestfish --format=raw -a nbd://localhost -i
209 Welcome to guestfish, the guest filesystem shell for
210 editing virtual machine filesystems and disk images.
212 Type: ‘help’ for help on commands
213 ‘man’ to read the manual
214 ‘quit’ to quit the shell
216 Operating system: Fedora 26 (Twenty Six)
217 /dev/sda3 mounted on /
218 /dev/sda1 mounted on /boot
220 ><fs> list-filesystems
226 You can see that the NBD disk contains three
227 partitions\footnote{\texttt{/dev/sdX} inside libguestfs is equivalent
228 to \texttt{/dev/nbd0} on the host}.
231 \section{Mounting a VMware VMDK file}
233 VMware VMDK disk images are difficult to open on Linux machines.
234 VMware provides a proprietary library to handle them, and nbdkit has a
235 plugin to handle this library (the plugin is free software, but the
236 VMware library that it talks to is definitely not). We can use this
237 to loopback mount VMDK files:
240 # nbdkit vddk file=TestLinux-disk1.vmdk
241 # nbd-client -b 512 localhost 10809 /dev/nbd0
244 This disk image contains two partitions and several logical volumes.
245 The Linux kernel finds them all automatically:
248 # file -bsL /dev/nbd0p1
249 Linux rev 1.0 ext4 filesystem data, UUID=9d1d5cb7-b453-48ac-b83b-76831398232f (needs journal recovery) (extents) (huge files)
250 # file -bsL /dev/nbd0p2
251 LVM2 PV (Linux Logical Volume Manager), UUID: bIY2oM-CgAN-npqG-gItS-WY6e-wO7d-L6G3Bv, size: 8377444864
252 # ls /dev/vg_testlinux/
256 You can read and write to VMDK files this way:
259 # mount /dev/vg_testlinux/lv_root /mnt
264 \section{Testing a RAID array}
266 Let's make a RAID array using in-memory block devices. But to test
267 them we'll want a way to inject errors into those block devices.
268 nbdkit makes this easy with its \textit{error filter}:
271 # nbdkit --filter=error memory size=1G \
272 error-file=/tmp/error0 error-rate=1 -p 10810
273 # nbdkit --filter=error memory size=1G \
274 error-file=/tmp/error1 error-rate=1 -p 10811
275 # nbdkit --filter=error memory size=1G \
276 error-file=/tmp/error2 error-rate=1 -p 10812
277 # nbdkit --filter=error memory size=1G \
278 error-file=/tmp/error3 error-rate=1 -p 10813
279 # nbdkit --filter=error memory size=1G \
280 error-file=/tmp/error4 error-rate=1 -p 10814
281 # nbdkit --filter=error memory size=1G \
282 error-file=/tmp/error5 error-rate=1 -p 10815
285 We can create 6 NBD devices from these:
288 # nbd-client localhost 10810 /dev/nbd0
289 # nbd-client localhost 10811 /dev/nbd1
290 # nbd-client localhost 10812 /dev/nbd2
291 # nbd-client localhost 10813 /dev/nbd3
292 # nbd-client localhost 10814 /dev/nbd4
293 # nbd-client localhost 10815 /dev/nbd5
296 And we can create a RAID 5 device on top:
299 # mdadm -C /dev/md0 --level=5 \
300 --raid-devices=5 --spare-devices=1 \
301 /dev/nbd{0,1,2,3,4,5}
302 mdadm: Defaulting to version 1.2 metadata
303 mdadm: array /dev/md0 started.
304 # mkfs -t ext4 /dev/md0
305 # mount /dev/md0 /mnt
308 You can see we have 5 drives and 1 spare in the array:
312 Personalities : [raid6] [raid5] [raid4]
313 md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
314 4186112 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
317 nbdkit's error filter is trigger by the presence of the error files
318 \texttt{/tmp/error*}. By creating these files we can inject errors
319 into specific devices and see how the RAID array responds.
321 Firstly I inject errors into \texttt{/dev/nbd0}:
327 After a while the kernel notices:
330 [10804.798999] print_req_error: I/O error, dev nbd0, sector 100360
331 [10804.868378] md: recovery of RAID array md0
332 [10805.202631] md/raid:md0: read error corrected (8 sectors at 69928 on nbd0)
333 [10810.349550] md: md0: recovery done.
336 Comparing \texttt{/proc/mdstat} before and after:
339 -md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
340 +md0 : active raid5 nbd4[6] nbd5[5] nbd3[3] nbd2[2] nbd1[1] nbd0[0](F)
343 shows that the spare drive is now in use and nbd0 is marked as Failed.
345 I can inject errors into a second drive:
349 [11039.428009] block nbd1: Other side returned error (5)
350 [11039.431659] print_req_error: I/O error, dev nbd1, sector 231424
351 [11039.448757] block nbd1: Other side returned error (5)
352 [11039.452367] print_req_error: I/O error, dev nbd1, sector 233280
353 [11084.767968] md/raid:md0: Disk failure on nbd1, disabling device.
354 md/raid:md0: Operation continuing on 4 devices.
357 and now the array is operating in a degraded state. At the filesystem
358 level everything is still fine.
361 \section{Writing a Linux block device in shell script}
363 \textit{nbdkit allows you to write plugins in various programming
364 languages, including shell script. In the talk I will demonstrate a
365 Linux block device being written as a shell script.}
368 \section{Logging and visualization}
370 \textit{I am planning some visualization tools that will let you see
371 exactly how a block device is being read and written during common
372 operations like filesystem creation, file allocation, fstrim, and so
373 on. The talk will end with a demonstration of these tools.}