1 \documentclass[12pt,a4paper]{article}
2 \usepackage[utf8x]{inputenc}
8 linkcolor={red!50!black},
9 citecolor={blue!50!black},
10 urlcolor={blue!80!black}
13 %\usepackage{graphicx}
14 %\DeclareGraphicsExtensions{.pdf,.png,.jpg}
24 \title{Better loopback mounts with NBD}
28 \normalsize Red Hat Inc.
29 \normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com}
37 Loopback mounts let you mount a raw file as a device. Network Block
38 Device with the nbdkit server takes this concept to the next level.
39 You can mount compressed files. Create block devices from
40 concatenated files. Mount esoteric formats like VMDK. NBD can also
41 be used for testing: You can create giant devices for testing. Inject
42 errors on demand into your block devices to test error detection and
43 recovery. Add delays to make disks deliberately slow. I will also
44 show you how to write block devices using shell scripts, and do
45 advanced visualization of how the kernel and filesystems use block
50 \section{Network Block Device}
52 \textit{In the talk there will be an introduction to and history of
53 Network Block Device. I'm not reproducing that here since you can
54 read about the history in articles such as
55 \url{https://www.linuxjournal.com/article/3778}. There will also be
56 a short introduction to nbdkit, our pluggable, scriptable NBD
57 server. For now, see \url{https://github.com/libguestfs/nbdkit}. }
60 \section{Loopback mounts -- simple but very limited}
62 Loopback mounting a file is simple:
65 # truncate -s 10M /tmp/test.img
66 # mke2fs -t ext2 /tmp/test.img
67 # losetup -f /tmp/test.img
68 # blockdev --getsize64 /dev/loop0
70 # mount /dev/loop0 /mnt
73 But this talk is about all the things you \textit{cannot} do with a
74 loopback mount. What if the file you want to mount is compressed?
75 What if you want to concatenate several files? What if you want to
76 use another type of storage instead of a file?
78 You can't do those things with a loopback mount, but there is now an
79 alternative: A loopback Network Block Device, backed by our pluggable,
80 scriptable \textbf{nbdkit} server. It's just as simple to use as
81 loopback mounts, but far more flexible.
86 If you want to follow these examples on your own machine, you will
87 need to install the \texttt{nbd-client} package (on Fedora:
88 \texttt{nbd}), and the \texttt{nbdkit} server. Most examples require
91 Linux Network Block Device is in general very reliable, but there were
92 unfortunately a couple bugs in the latest released version that is
93 present in several Linux distributions (but fixed upstream).
95 If your Linux distro ships with NBD 3.17, make sure it includes the
96 following post-3.17 fix for kernel timeouts:
97 \url{https://github.com/NetworkBlockDevice/nbd/pull/82}
99 If your Linux distro uses kernel $<4.17$, then upgrading to
100 4.17 or above is recommended.
102 You may also need to run this command once before you start:
109 \section{Mounting xz-compressed disks}
111 Loopback mounting a compressed disk will expose a block device
112 containing the compressed data, which is not very useful.
114 nbdkit has a couple of plugins for handling gzip and xz compressed
115 disks. The xz plugin is quite efficient, allowing read-only random
116 access to compressed files:
119 # nbdkit xz fedora-26.xz
122 We can make a loopback mount called \texttt{/dev/nbd0} using one
126 # nbd-client -b 512 localhost 10809 /dev/nbd0
129 Linux automatically creates block devices for each partition in the
130 original (Fedora 26) disk image:
134 nbd0 nbd0p1 nbd0p2 nbd0p3
135 # file -bsL /dev/nbd0p3
136 SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)
137 # mount /dev/nbd0p3 /mnt
138 mount: /mnt: WARNING: device write-protected, mounted read-only.
139 # cat /mnt/etc/redhat-release
140 Fedora release 26 (Twenty Six)
147 # nbd-client -d /dev/nbd0
152 \section{Creating a huge btrfs filesystem in memory}
154 nbdkit is not limited to serving files or even to the limits of disk
155 space. You can create enormous filesystems in memory:
158 # nbdkit memory size=$(( 2**63 - 1 ))
159 # nbd-client -b 512 localhost 10809 /dev/nbd0
162 How big is this? $2^{63}-1$ is about 8.5~billion gigabytes. If you
163 were to buy that amount of disk at retail it would cost you
164 \textbf{\euro~300~million}\footnote{September 2018 prices, WD Red SATA
165 drives bought on Amazon.fr}.
167 We can partition and create a filesystem just like any other device:
171 Number Start (sector) End (sector) Size Code Name
172 1 1024 9007199254740973 8.0 EiB 8300 Linux filesystem
173 # mkfs.btrfs -K /dev/nbd0p1
174 # mount /dev/nbd0p1 /mnt
176 Filesystem Size Used Avail Use% Mounted on
177 /dev/nbd0p1 8.0E 17M 8.0E 1% /mnt
180 When you unmount the NBD partition and kill nbdkit, the device is
181 gone, making this very useful for testing filesystems.
184 \section{Concatenating files into a partitioned disk}
186 \textit{In the talk this section will talk about creating a virtual
187 disk with a virtual partition table using the nbdkit
188 ``partitioning'' plugin.}
191 \section{Mounting a VMware VMDK file}
193 \textit{In the talk this section will talk about modifying VMware VMDK
194 files using the nbdkit ``vddk'' plugin.}
197 \section{Testing a RAID array}
199 Let's make a RAID array using in-memory block devices. But to test
200 them we'll want a way to inject errors into those block devices.
201 nbdkit makes this easy with its \textit{error filter}:
204 # nbdkit --filter=error memory size=1G \
205 error-file=/tmp/error0 error-rate=1 -p 10810
206 # nbdkit --filter=error memory size=1G \
207 error-file=/tmp/error1 error-rate=1 -p 10811
208 # nbdkit --filter=error memory size=1G \
209 error-file=/tmp/error2 error-rate=1 -p 10812
210 # nbdkit --filter=error memory size=1G \
211 error-file=/tmp/error3 error-rate=1 -p 10813
212 # nbdkit --filter=error memory size=1G \
213 error-file=/tmp/error4 error-rate=1 -p 10814
214 # nbdkit --filter=error memory size=1G \
215 error-file=/tmp/error5 error-rate=1 -p 10815
218 We can create 6 NBD devices from these:
221 # nbd-client localhost 10810 /dev/nbd0
222 # nbd-client localhost 10811 /dev/nbd1
223 # nbd-client localhost 10812 /dev/nbd2
224 # nbd-client localhost 10813 /dev/nbd3
225 # nbd-client localhost 10814 /dev/nbd4
226 # nbd-client localhost 10815 /dev/nbd5
229 And we can create a RAID 5 device on top:
232 # mdadm -C /dev/md0 --level=5 \
233 --raid-devices=5 --spare-devices=1 \
234 /dev/nbd{0,1,2,3,4,5}
235 mdadm: Defaulting to version 1.2 metadata
236 mdadm: array /dev/md0 started.
237 # mkfs -t ext4 /dev/md0
238 # mount /dev/md0 /mnt
241 You can see we have 5 drives and 1 spare in the array:
245 Personalities : [raid6] [raid5] [raid4]
246 md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
247 4186112 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
250 nbdkit's error filter is trigger by the presence of the error files
251 \texttt{/tmp/error*}. By creating these files we can inject errors
252 into specific devices and see how the RAID array responds.
254 Firstly I inject errors into \texttt{/dev/nbd0}:
260 After a while the kernel notices:
263 [10804.798999] print_req_error: I/O error, dev nbd0, sector 100360
264 [10804.868378] md: recovery of RAID array md0
265 [10805.202631] md/raid:md0: read error corrected (8 sectors at 69928 on nbd0)
266 [10810.349550] md: md0: recovery done.
269 Comparing \texttt{/proc/mdstat} before and after:
272 -md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0]
273 +md0 : active raid5 nbd4[6] nbd5[5] nbd3[3] nbd2[2] nbd1[1] nbd0[0](F)
276 shows that the spare drive is now in use and nbd0 is marked as Failed.
278 I can inject errors into a second drive:
282 [11039.428009] block nbd1: Other side returned error (5)
283 [11039.431659] print_req_error: I/O error, dev nbd1, sector 231424
284 [11039.448757] block nbd1: Other side returned error (5)
285 [11039.452367] print_req_error: I/O error, dev nbd1, sector 233280
286 [11084.767968] md/raid:md0: Disk failure on nbd1, disabling device.
287 md/raid:md0: Operation continuing on 4 devices.
290 and now the array is operating in a degraded state. At the filesystem
291 level everything is still fine.
294 \section{Writing a Linux block device in shell script}
296 \textit{nbdkit allows you to write plugins in various programming
297 languages, including shell script. In the talk I will demonstrate a
298 Linux block device being written as a shell script.}
301 \section{Logging and visualization}
303 \textit{I am planning some visualization tools that will let you see
304 exactly how a block device is being read and written during common
305 operations like filesystem creation, file allocation, fstrim, and so
306 on. The talk will end with a demonstration of these tools.}