1 \documentclass[12pt,a4paper]{article}
2 \usepackage[utf8x]{inputenc}
8 linkcolor={red!50!black},
9 citecolor={blue!50!black},
10 urlcolor={blue!80!black}
13 %\usepackage{graphicx}
14 %\DeclareGraphicsExtensions{.pdf,.png,.jpg}
24 \title{Better loopback mounts with NBD}
28 \normalsize Red Hat Inc.
29 \normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com}
37 Loopback mounts let you mount a raw file as a device. Network Block
38 Device with the nbdkit server takes this concept to the next level.
39 You can mount compressed files. Create block devices from
40 concatenated files. Mount esoteric formats like VMDK. NBD can also
41 be used for testing: You can create giant devices for testing. Inject
42 errors on demand into your block devices to test error detection and
43 recovery. Add delays to make disks deliberately slow. I will also
44 show you how to write block devices using shell scripts, and do
45 advanced visualization of how the kernel and filesystems use block
50 \section{Network Block Device}
52 \textit{In the talk there will be an introduction to and history of
53 Network Block Device. I'm not reproducing that here since you can
54 read about the history in articles such as
55 \url{https://www.linuxjournal.com/article/3778}. There will also be
56 a short introduction to nbdkit, our pluggable, scriptable NBD
57 server. For now, see \url{https://github.com/libguestfs/nbdkit}. }
60 \section{Loopback mounts -- simple but very limited}
62 Loopback mounting a file is simple:
65 # truncate -s 10M /tmp/test.img
66 # mke2fs -t ext2 /tmp/test.img
67 # losetup -f /tmp/test.img
68 # blockdev --getsize64 /dev/loop0
70 # mount /dev/loop0 /mnt
73 But this talk is about all the things you \textit{cannot} do with a
74 loopback mount. What if the file you want to mount is compressed?
75 What if you want to concatenate several files? What if you want to
76 use another type of storage instead of a file?
78 You can't do those things with a loopback mount, but there is now an
79 alternative: A loopback Network Block Device, backed by our pluggable,
80 scriptable \textbf{nbdkit} server. It's just as simple to use as
81 loopback mounts, but far more flexible.
86 If you want to follow these examples on your own machine, you will
87 need to install the \texttt{nbd-client} package (on Fedora:
88 \texttt{nbd}), and the \texttt{nbdkit} server. Most examples require
91 Linux Network Block Device is in general very reliable, but there were
92 unfortunately a couple bugs in the latest released version that is
93 present in several Linux distributions (but fixed upstream).
95 If your Linux distro ships with NBD 3.17, make sure it includes the
96 following post-3.17 fix for kernel timeouts:
97 \url{https://github.com/NetworkBlockDevice/nbd/pull/82}
99 If your Linux distro uses kernel $<4.17$, then upgrading to
100 4.17 or above is recommended.
102 You may also need to run this command once before you start:
109 \section{Mounting xz-compressed disks}
111 Loopback mounting a compressed disk will expose a block device
112 containing the compressed data, which is not very useful.
114 nbdkit has a couple of plugins for handling gzip and xz compressed
115 disks. The xz plugin is quite efficient, allowing read-only random
116 access to compressed files:
119 # nbdkit xz fedora-26.xz
122 We can make a loopback mount called \texttt{/dev/nbd0} using one
126 # nbd-client -b 512 localhost 10809 /dev/nbd0
129 Linux automatically creates block devices for each partition in the
130 original (Fedora 26) disk image:
134 nbd0 nbd0p1 nbd0p2 nbd0p3
135 # file -bsL /dev/nbd0p3
136 SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)
137 # mount /dev/nbd0p3 /mnt
138 mount: /mnt: WARNING: device write-protected, mounted read-only.
139 # cat /mnt/etc/redhat-release
140 Fedora release 26 (Twenty Six)
147 # nbd-client -d /dev/nbd0
152 \section{Creating a huge btrfs filesystem in memory}
154 nbdkit is not limited to serving files or even to the limits of disk
155 space. You can create enormous filesystems in memory:
158 # nbdkit memory size=$(( 2**63 - 1 ))
159 # nbd-client -b 512 localhost 10809 /dev/nbd0
162 How big is this? $2^{63}-1$ is about 8.5~billion gigabytes. If you
163 were to buy that amount of disk at retail it would cost you
164 \textbf{\euro~300~million}\footnote{September 2018 prices, WD Red SATA
165 drives bought on Amazon.fr}.
167 We can partition and create a filesystem just like any other device:
171 Number Start (sector) End (sector) Size Code Name
172 1 1024 9007199254740973 8.0 EiB 8300 Linux filesystem
173 # mkfs.btrfs -K /dev/nbd0p1
174 # mount /dev/nbd0p1 /mnt
176 Filesystem Size Used Avail Use% Mounted on
177 /dev/nbd0p1 8.0E 17M 8.0E 1% /mnt
180 When you unmount the NBD partition and kill nbdkit, the device is
181 gone, making this very useful for testing filesystems.
184 \section{Concatenating files into a partitioned disk}
186 Whereas loopback mounts are limited to a single file, there are
187 several nbdkit plugins for combining files. One of them is called the
188 ``partitioning'' plugin, and it turns partitions into disk images:
191 $ nbdkit partitioning \
197 This time I'll use \texttt{guestfish} to examine this virtual disk:
200 $ guestfish --format=raw -a nbd://localhost -i
202 Welcome to guestfish, the guest filesystem shell for
203 editing virtual machine filesystems and disk images.
205 Type: ‘help’ for help on commands
206 ‘man’ to read the manual
207 ‘quit’ to quit the shell
209 Operating system: Fedora 26 (Twenty Six)
210 /dev/sda3 mounted on /
211 /dev/sda1 mounted on /boot
213 ><fs> list-filesystems
219 You can see that the NBD disk contains three
220 partitions\footnote{\texttt{/dev/sdX} inside libguestfs is equivalent
221 to \texttt{/dev/nbd0} on the host}.
224 \section{Mounting a VMware VMDK file}
226 VMware VMDK disk images are difficult to open on Linux machines.
227 VMware provides a proprietary library to handle them, and nbdkit has a
228 plugin to handle this library (the plugin is free software, but the
229 VMware library that it talks to is definitely not). We can use this
230 to loopback mount VMDK files:
233 # nbdkit vddk file=TestLinux-disk1.vmdk
234 # nbd-client -b 512 localhost 10809 /dev/nbd0
237 This disk image contains two partitions and several logical volumes.
238 The Linux kernel finds them all automatically:
241 # file -bsL /dev/nbd0p1
242 Linux rev 1.0 ext4 filesystem data, UUID=9d1d5cb7-b453-48ac-b83b-76831398232f (needs journal recovery) (extents) (huge files)
243 # file -bsL /dev/nbd0p2
244 LVM2 PV (Linux Logical Volume Manager), UUID: bIY2oM-CgAN-npqG-gItS-WY6e-wO7d-L6G3Bv, size: 8377444864
245 # ls /dev/vg_testlinux/
249 You can read and write to VMDK files this way:
252 # mount /dev/vg_testlinux/lv_root /mnt
257 \section{Testing a RAID array}
259 Let's make a RAID array using in-memory block devices. But to test
260 them we'll want a way to inject errors into those block devices.
261 nbdkit makes this easy with its \textit{error filter}:
264 # nbdkit --filter=error memory size=1G \
265 error-file=/tmp/error0 error-rate=1 -p 10810
266 # nbdkit --filter=error memory size=1G \
267 error-file=/tmp/error1 error-rate=1 -p 10811
268 # nbdkit --filter=error memory size=1G \
269 error-file=/tmp/error2 error-rate=1 -p 10812
270 # nbdkit --filter=error memory size=1G \
271 error-file=/tmp/error3 error-rate=1 -p 10813
272 # nbdkit --filter=error memory size=1G \
273 error-file=/tmp/error4 error-rate=1 -p 10814
274 # nbdkit --filter=error memory size=1G \
275 error-file=/tmp/error5 error-rate=1 -p 10815
278 We can create 6 NBD devices from these:
281 # nbd-client localhost 10810 /dev/nbd0
282 # nbd-client localhost 10811 /dev/nbd1
283 # nbd-client localhost 10812 /dev/nbd2
284 # nbd-client localhost 10813 /dev/nbd3
285 # nbd-client localhost 10814 /dev/nbd4
286 # nbd-client localhost 10815 /dev/nbd5
289 And we can create a RAID 5 device on top:
292 # mdadm -C /dev/md0 --level=5 \
293 --raid-devices=5 --spare-devices=1 \
294 /dev/nbd{0,1,2,3,4,5}
295 mdadm: Defaulting to version 1.2 metadata
296 mdadm: array /dev/md0 started.
297 # mkfs -t ext4 /dev/md0
298 # mount /dev/md0 /mnt
301 You can see we have 5 drives and 1 spare in the array:
305 Personalities : [raid6] [raid5] [raid4]
306 md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
307 4186112 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
310 nbdkit's error filter is trigger by the presence of the error files
311 \texttt{/tmp/error*}. By creating these files we can inject errors
312 into specific devices and see how the RAID array responds.
314 Firstly I inject errors into \texttt{/dev/nbd0}:
320 After a while the kernel notices:
323 [10804.798999] print_req_error: I/O error, dev nbd0, sector 100360
324 [10804.868378] md: recovery of RAID array md0
325 [10805.202631] md/raid:md0: read error corrected (8 sectors at 69928 on nbd0)
326 [10810.349550] md: md0: recovery done.
329 Comparing \texttt{/proc/mdstat} before and after:
332 -md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
333 +md0 : active raid5 nbd4[6] nbd5[5] nbd3[3] nbd2[2] nbd1[1] nbd0[0](F)
336 shows that the spare drive is now in use and nbd0 is marked as Failed.
338 I can inject errors into a second drive:
342 [11039.428009] block nbd1: Other side returned error (5)
343 [11039.431659] print_req_error: I/O error, dev nbd1, sector 231424
344 [11039.448757] block nbd1: Other side returned error (5)
345 [11039.452367] print_req_error: I/O error, dev nbd1, sector 233280
346 [11084.767968] md/raid:md0: Disk failure on nbd1, disabling device.
347 md/raid:md0: Operation continuing on 4 devices.
350 and now the array is operating in a degraded state. At the filesystem
351 level everything is still fine.
354 \section{Writing a Linux block device in shell script}
356 \textit{nbdkit allows you to write plugins in various programming
357 languages, including shell script. In the talk I will demonstrate a
358 Linux block device being written as a shell script.}
361 \section{Logging and visualization}
363 \textit{I am planning some visualization tools that will let you see
364 exactly how a block device is being read and written during common
365 operations like filesystem creation, file allocation, fstrim, and so
366 on. The talk will end with a demonstration of these tools.}