From 11cec0e3dd88fa2eab119aaa087c0a9f120eb267 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" Date: Mon, 17 Sep 2018 10:35:40 +0100 Subject: [PATCH] Proposed paper for FOSDEM 2019. --- 2019-fosdem/paper/.gitignore | 8 + 2019-fosdem/paper/Makefile | 9 + .../paper/fosdem-rjones-better-loopback-paper.tex | 309 +++++++++++++++++++++ 3 files changed, 326 insertions(+) create mode 100644 2019-fosdem/paper/.gitignore create mode 100644 2019-fosdem/paper/Makefile create mode 100644 2019-fosdem/paper/fosdem-rjones-better-loopback-paper.tex diff --git a/2019-fosdem/paper/.gitignore b/2019-fosdem/paper/.gitignore new file mode 100644 index 0000000..a8ad516 --- /dev/null +++ b/2019-fosdem/paper/.gitignore @@ -0,0 +1,8 @@ +/.~lock.*# +/*.aux +/*.dvi +/*.fdb_latexmk +/*.fls +/*.log +/*.out +/*.pdf diff --git a/2019-fosdem/paper/Makefile b/2019-fosdem/paper/Makefile new file mode 100644 index 0000000..6593ce4 --- /dev/null +++ b/2019-fosdem/paper/Makefile @@ -0,0 +1,9 @@ +paper = fosdem-rjones-better-loopback-paper + +all: $(paper).pdf + +$(paper).pdf: $(paper).tex + latexmk -pdf $< + +clean: + rm -f *.pdf *.aux *.out *.log *~ diff --git a/2019-fosdem/paper/fosdem-rjones-better-loopback-paper.tex b/2019-fosdem/paper/fosdem-rjones-better-loopback-paper.tex new file mode 100644 index 0000000..6a0e47f --- /dev/null +++ b/2019-fosdem/paper/fosdem-rjones-better-loopback-paper.tex @@ -0,0 +1,309 @@ +\documentclass[12pt,a4paper]{article} +\usepackage[utf8x]{inputenc} +\usepackage{parskip} +\usepackage{hyperref} +\usepackage{xcolor} +\hypersetup{ + colorlinks, + linkcolor={red!50!black}, + citecolor={blue!50!black}, + urlcolor={blue!80!black} +} +\usepackage{abstract} +%\usepackage{graphicx} +%\DeclareGraphicsExtensions{.pdf,.png,.jpg} +\usepackage{eurosym} +\usepackage{float} +\floatstyle{boxed} +\restylefloat{figure} +\usepackage{fancyhdr} + \pagestyle{fancy} + %\fancyhead{} + %\fancyfoot{} + +\title{Better loopback mounts with NBD} +\author{ +\large +Richard W.M. Jones +\normalsize Red Hat Inc. +\normalsize \href{mailto:rjones@redhat.com}{rjones@redhat.com} +} +\date{February 2019} + +\begin{document} +\maketitle + +\begin{abstract} +Loopback mounts let you mount a raw file as a device. Network Block +Device with the nbdkit server takes this concept to the next level. +You can mount compressed files. Create block devices from +concatenated files. Mount esoteric formats like VMDK. NBD can also +be used for testing: You can create giant devices for testing. Inject +errors on demand into your block devices to test error detection and +recovery. Add delays to make disks deliberately slow. I will also +show you how to write block devices using shell scripts, and do +advanced visualization of how the kernel and filesystems use block +devices. +\end{abstract} + + +\section{Network Block Device} + +\textit{In the talk there will be an introduction to and history of + Network Block Device. I'm not reproducing that here since you can + read about the history in articles such as + \url{https://www.linuxjournal.com/article/3778}. There will also be + a short introduction to nbdkit, our pluggable, scriptable NBD + server. For now, see \url{https://github.com/libguestfs/nbdkit}. } + + +\section{Loopback mounts -- simple but very limited} + +Loopback mounting a file is simple: + +\begin{verbatim} +# truncate -s 10M /tmp/test.img +# mke2fs -t ext2 /tmp/test.img +# losetup -f /tmp/test.img +# blockdev --getsize64 /dev/loop0 +10485760 +# mount /dev/loop0 /mnt +\end{verbatim} + +But this talk is about all the things you \textit{cannot} do with a +loopback mount. What if the file you want to mount is compressed? +What if you want to concatenate several files? What if you want to +use another type of storage instead of a file? + +You can't do those things with a loopback mount, but there is now an +alternative: A loopback Network Block Device, backed by our pluggable, +scriptable \textbf{nbdkit} server. It's just as simple to use as +loopback mounts, but far more flexible. + + +\section{Preparation} + +If you want to follow these examples on your own machine, you will +need to install the \texttt{nbd-client} package (on Fedora: +\texttt{nbd}), and the \texttt{nbdkit} server. Most examples require +nbdkit $\geq 1.7.3$. + +Linux Network Block Device is in general very reliable, but there were +unfortunately a couple bugs in the latest released version that is +present in several Linux distributions (but fixed upstream). + +If your Linux distro ships with NBD 3.17, make sure it includes the +following post-3.17 fix for kernel timeouts: +\url{https://github.com/NetworkBlockDevice/nbd/pull/82} + +If your Linux distro uses kernel $<4.17$, then upgrading to +4.17 or above is recommended. + +You may also need to run this command once before you start: + +\begin{verbatim} +# modprobe nbd +\end{verbatim} + + +\section{Mounting xz-compressed disks} + +Loopback mounting a compressed disk will expose a block device +containing the compressed data, which is not very useful. + +nbdkit has a couple of plugins for handling gzip and xz compressed +disks. The xz plugin is quite efficient, allowing read-only random +access to compressed files: + +\begin{verbatim} +# nbdkit xz fedora-26.xz +\end{verbatim} + +We can make a loopback mount called \texttt{/dev/nbd0} using one +command: + +\begin{verbatim} +# nbd-client -b 512 localhost 10809 /dev/nbd0 +\end{verbatim} + +Linux automatically creates block devices for each partition in the +original (Fedora 26) disk image: + +\begin{verbatim} +# ll /dev/nbd0 +nbd0 nbd0p1 nbd0p2 nbd0p3 +# file -bsL /dev/nbd0p3 +SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs) +# mount /dev/nbd0p3 /mnt +mount: /mnt: WARNING: device write-protected, mounted read-only. +# cat /mnt/etc/redhat-release +Fedora release 26 (Twenty Six) +\end{verbatim} + +To clean up: + +\begin{verbatim} +# umount /mnt +# nbd-client -d /dev/nbd0 +# killall nbdkit +\end{verbatim} + + +\section{Creating a huge btrfs filesystem in memory} + +nbdkit is not limited to serving files or even to the limits of disk +space. You can create enormous filesystems in memory: + +\begin{verbatim} +# nbdkit memory size=$(( 2**63 - 1 )) +# nbd-client -b 512 localhost 10809 /dev/nbd0 +\end{verbatim} + +How big is this? $2^{63}-1$ is about 8.5~billion gigabytes. If you +were to buy that amount of disk at retail it would cost you +\textbf{\euro~300~million}\footnote{September 2018 prices, WD Red SATA + drives bought on Amazon.fr}. + +We can partition and create a filesystem just like any other device: + +\begin{verbatim} +# gdisk /dev/nbd0 +Number Start (sector) End (sector) Size Code Name + 1 1024 9007199254740973 8.0 EiB 8300 Linux filesystem +# mkfs.btrfs -K /dev/nbd0p1 +# mount /dev/nbd0p1 /mnt +]# df -h /mnt +Filesystem Size Used Avail Use% Mounted on +/dev/nbd0p1 8.0E 17M 8.0E 1% /mnt +\end{verbatim} + +When you unmount the NBD partition and kill nbdkit, the device is +gone, making this very useful for testing filesystems. + + +\section{Concatenating files into a partitioned disk} + +\textit{In the talk this section will talk about creating a virtual + disk with a virtual partition table using the nbdkit + ``partitioning'' plugin.} + + +\section{Mounting a VMware VMDK file} + +\textit{In the talk this section will talk about modifying VMware VMDK + files using the nbdkit ``vddk'' plugin.} + + +\section{Testing a RAID array} + +Let's make a RAID array using in-memory block devices. But to test +them we'll want a way to inject errors into those block devices. +nbdkit makes this easy with its \textit{error filter}: + +\begin{verbatim} +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error0 error-rate=1 -p 10810 +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error1 error-rate=1 -p 10811 +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error2 error-rate=1 -p 10812 +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error3 error-rate=1 -p 10813 +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error4 error-rate=1 -p 10814 +# nbdkit --filter=error memory size=1G \ + error-file=/tmp/error5 error-rate=1 -p 10815 +\end{verbatim} + +We can create 6 NBD devices from these: + +\begin{verbatim} +# nbd-client localhost 10810 /dev/nbd0 +# nbd-client localhost 10811 /dev/nbd1 +# nbd-client localhost 10812 /dev/nbd2 +# nbd-client localhost 10813 /dev/nbd3 +# nbd-client localhost 10814 /dev/nbd4 +# nbd-client localhost 10815 /dev/nbd5 +\end{verbatim} + +And we can create a RAID 5 device on top: + +\begin{verbatim} +# mdadm -C /dev/md0 --level=5 \ + --raid-devices=5 --spare-devices=1 \ + /dev/nbd{0,1,2,3,4,5} +mdadm: Defaulting to version 1.2 metadata +mdadm: array /dev/md0 started. +# mkfs -t ext4 /dev/md0 +# mount /dev/md0 /mnt +\end{verbatim} + +You can see we have 5 drives and 1 spare in the array: + +\begin{verbatim} +# cat /proc/mdstat +Personalities : [raid6] [raid5] [raid4] +md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0] + 4186112 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] +\end{verbatim} + +nbdkit's error filter is trigger by the presence of the error files +\texttt{/tmp/error*}. By creating these files we can inject errors +into specific devices and see how the RAID array responds. + +Firstly I inject errors into \texttt{/dev/nbd0}: + +\begin{verbatim} +# touch /tmp/error0 +\end{verbatim} + +After a while the kernel notices: + +\begin{verbatim} +[10804.798999] print_req_error: I/O error, dev nbd0, sector 100360 +[10804.868378] md: recovery of RAID array md0 +[10805.202631] md/raid:md0: read error corrected (8 sectors at 69928 on nbd0) +[10810.349550] md: md0: recovery done. +\end{verbatim} + +Comparing \texttt{/proc/mdstat} before and after: + +\begin{verbatim} +-md0 : active raid5 nbd4[6] nbd5[5](S) nbd3[3] nbd2[2] nbd1[1] nbd0[0] ++md0 : active raid5 nbd4[6] nbd5[5] nbd3[3] nbd2[2] nbd1[1] nbd0[0](F) +\end{verbatim} + +shows that the spare drive is now in use and nbd0 is marked as Failed. + +I can inject errors into a second drive: + +\begin{verbatim} +# touch /tmp/error1 +[11039.428009] block nbd1: Other side returned error (5) +[11039.431659] print_req_error: I/O error, dev nbd1, sector 231424 +[11039.448757] block nbd1: Other side returned error (5) +[11039.452367] print_req_error: I/O error, dev nbd1, sector 233280 +[11084.767968] md/raid:md0: Disk failure on nbd1, disabling device. + md/raid:md0: Operation continuing on 4 devices. +\end{verbatim} + +and now the array is operating in a degraded state. At the filesystem +level everything is still fine. + + +\section{Writing a Linux block device in shell script} + +\textit{nbdkit allows you to write plugins in various programming + languages, including shell script. In the talk I will demonstrate a + Linux block device being written as a shell script.} + + +\section{Logging and visualization} + +\textit{I am planning some visualization tools that will let you see + exactly how a block device is being read and written during common + operations like filesystem creation, file allocation, fstrim, and so + on. The talk will end with a demonstration of these tools.} + + +\end{document} -- 1.8.3.1