1. History and architecture --------------------------- I want to explain first of all where libguestfs came from and how the architecture works. In about 2008 it was very clear Red Hat had a lot of problems reading and modifying disk images from virtual machines. For some disk images, you could use tools like 'losetup' and 'kpartx' to mount them on the host kernel. But that doesn't work in many cases, for example: - The disk image is not "raw format" (eg. qcow2). - The disk image uses LVM (because names and UUIDs can conflict with LVM names/UUIDs used by the host or other VMs). It also requires root privileges, which means any program that wanted to read a disk image would need to run as root. It's also insecure, since malformed disk images can exploit bugs in the host kernel to gain root on the host (this *cannot* be protected against using UIDs or SELinux). 1.1 Architecture of libguestfs ------------------------------ Libguestfs is the solution to the above problems. Let's see how libguestfs works. [SLIDE 1] You'll be familiar with an ordinary Linux virtual machine. The Linux VM runs inside a host process called "qemu". The Linux guest has a kernel and userspace, and the qemu process translates requests from the guest into accesses to a host disk image. The host disk image could be stored in an ordinary file, or it could be stored in a host logical volume, and it could be stored in several formats like raw, qcow2, VMDK and so on. [SLIDE 2] That's an ordinary Linux VM. libguestfs uses the same technique, but using a special VM that we call the "libguestfs appliance" or just the "appliance". The appliance is a cut down, much smaller Linux operating system, running inside qemu. It has the userspace tools like "lvm", "parted" and so on. But it's also special because it only runs a single process, called "guestfsd" (the guestfs daemon). It uses qemu to access the disk image, in exactly the same way as an ordinary VM. What creates the appliance - and who controls it? [SLIDE 3] Libguestfs is also a C library ("/usr/lib64/libguestfs.so.0"). It is this library that creates the appliance -- just by running qemu. The C library also talks to the guestfs daemon over a simple command channel, and it sends commands to it. Commands are things like: - Return a list of all the partitions ("part_list"). - Create a new filesystem ("mkfs"). - Write this data into a file ("write"). 1.2 libguestfs approach vs others --------------------------------- Some advantages of this approach: - We can support every qemu feature qcow2 / ceph remote access / iscsi / NBD / compressed / sparse ... - We can support every filesystem that Linux kernel supports ext4 / btrfs / xfs / NTFS / ... - We're using the same drivers as Linux (eg. ext4.ko), so all the filesystem features work. - LVM etc. "just works" - It doesn't need root (because you can run qemu on the host as any user). - It's secure (non-root, sVirt, libvirt containerization). Disadvantages: - Architecturally complex. - Slower than direct mounting. The main job of libguestfs is to: - Hide the complexity of the appliance. - Make it simple to use, fast, and reliable. - Offer a stable API to programs. - Offer useful tools on top for everyday tasks. 1.3 Example ----------- As an example of how this would work: (1) Program linked to libguestfs calls "guestfs_part_list" (an API). (2) The library sends the "part_list" command. (3) The command is serialized and sent as a message from the library to the guestfs daemon. (4) The daemon runs "parted -m -s -- /dev/sda unit b print" (this is happening inside the appliance). (5) qemu does a lot of complicated translations - especially if the disk image uses qcow2. That happens "by magic", we don't see it or have to worry about it. (6) "parted" prints out a list of partitions, which the daemon parses and serializes into a reply message. (7) The reply message is sent back to the library, which unpacks the data and passes it back to the caller. You can try this for yourself. "guestfish" is a C program that links to the libguestfs.so.0 library. It is a very thin wrapper over the libguestfs C API -- all it does really is parse commands and print out replies. $ virt-builder centos-6 $ guestfish Welcome to guestfish, the guest filesystem shell for editing virtual machine filesystems and disk images. Type: 'help' for help on commands 'man' to read the manual 'quit' to quit the shell > add centos-6.img readonly:true > run > part-list /dev/sda [0] = { part_num: 1 part_start: 1048576 part_end: 537919487 part_size: 536870912 } [1] = { part_num: 2 part_start: 537919488 part_end: 1611661311 part_size: 1073741824 } [2] = { part_num: 3 part_start: 1611661312 part_end: 6442450943 part_size: 4830789632 } > exit "add" [C API: guestfs_add_drive_opts] tells libguestfs to how to construct the qemu command. It roughly translates into: qemu -drive file=centos-6.img,snapshot=on "run" [C API: guestfs_launch] is what runs the qemu command, creating the appliance. It also sets up the message channel between the library and the guestfs daemon. "part-list" [C API: guestfs_part_list] translates directly into a message sent to the guestfs daemon. Not all commands work like this: some are further translated by the library, and may result in many messages being sent to the daemon, or none at all. 1.3.1 Debugging --------------- guestfish gives you a way to see the lower levels at work. Just add the "guestfish -v -x" flags. "-x" traces all libguestfs API calls. "-v" prints out all debug output from the library and the appliance, which includes appliance kernel messages. Almost all commands take the "-v -x" flags (except virt-win-reg for obscure historical reasons). 2. More about the appliance --------------------------- 2.1 Running the appliance: direct vs libvirt -------------------------------------------- In RHEL we try to stop people running qemu directly, and point them towards libvirt for managing virtual machines. Libguestfs has the same concern: Should it run the qemu command directly, or should it use libvirt to run the qemu command. There are pros and cons: - Running qemu directly gives us the most flexibility, eg. if we need to use a new qemu feature which libvirt doesn't support. - Libvirt implements extra security: SELinux (SVirt), separate 'qemu' UID, cgroups. - Libvirt is a big component with many complicated moving parts, meaning that using libvirt is less reliable. Over time, we have added all the features we need to libvirt. In fact, now using libvirt we can access *more* qemu features than by running qemu directly. However there are still reliability issues with libvirt. RHEL 6: Always used the 'direct' method (running qemu directly). RHEL 7: Defaults to 'libvirt' method, but provides a fallback in case users have reliability problems: export LIBGUESTFS_BACKEND=direct 2.2 SELinux / sVirt ------------------- [SLIDE 4] In the ordinary case where you are hosting many virtual machines on a single physical machine, libvirt runs all those virtual machines as the same non-root user ("qemu:qemu"). Unfortunately this means that if one VM is exploited because of some bug in qemu, it could then go on to exploit other VMs on the same host. This is because there is no host protection between different processes running as the same user. [SLIDE 5] SVirt prevents this using SELinux. What it does is it gives each VM a different SELinux label. It labels every resource that a VM needs (like all its disk images) with that SELinux label. And it adds SELinux policies that prevent one VM from accessing another VM's differently-labelled resources. Libguestfs (when using the libvirt backend) uses the same mechanism. It prevents an exploit from one disk image from possibly escalating to other disk images, and is important for use cases like RHEV and OpenStack where a single host user (eg. "vdsm") is using many libguestfs handles at the same time. 2.3 Creating the appliance: supermin ------------------------------------ I didn't talk about how the appliance is built. It's a small Linux-based OS, but how do we make it? Is it RHEL? Is it Fedora? (The answer: sort of, but not really). We have several constraints when building the appliance, which may not be obvious: - Cannot compile our own kernel. It wouldn't be supported by RHEL. - Cannot distribute a huge, binary-only blob. It would be too large to download, and static linking is generally forbidden in Fedora, RHEL, and most other Linux distros. - Want to get bug/security fixes from the distro automatically. [SLIDE 6] The idea is that we build the appliance from the host distro. If the host distro is RHEL 7, then we copy the programs we need from RHEL to make the appliance. All of the programs and libraries ("parted", "lvm", "libc.so.6") and the kernel and kernel modules get copied to make the appliance. If a program on the host gets updated (eg. to fix a bug), we copy in the new program the next time libguestfs runs. The appliance is created on the end-user's machine, at run time. That's why libguestfs takes longer to run the first time you run it, or just after you've done a "yum" command (since it rebuilds the appliance if there are upgraded binaries). This is quite complex, but it is controlled by a command line program called "supermin" ("supermin5" on RHEL 7), which you can try out: $ supermin --build /usr/lib64/guestfs/supermin.d \ -o /tmp/appliance.d --format ext2 supermin: open: /usr/bin/chfn: Permission denied * supermin: open: /usr/bin/chsh: Permission denied $ ls -lh /tmp/appliance.d/ 1.2M initrd 35 kernel -> /boot/vmlinuz-4.1.6-200.fc22.x86_64 4.0G root "root" is the appliance (root disk). * The "Permission denied" errors are harmless in this case. We are trying to get this changed in Fedora [https://fedorahosted.org/fpc/ticket/467]. 2.4 libguestfs-winsupport ------------------------- In RHEL 7.2, you have to install an additional package called "libguestfs-winsupport" to enable NTFS (Windows filesystem) support. This relies on an upstream project called ntfs-3g which has reverse-engineered the NTFS internals. We don't ship ntfs-3g in RHEL, so there are no "ntfs-3g programs" that can be copied from the host. How does it work? $ rpm -ql libguestfs-winsupport /usr/lib64/guestfs/supermin.d/zz-winsupport.tar.gz $ zcat /usr/lib64/guestfs/supermin.d/zz-winsupport.tar.gz | tar tf - ./ ./usr/ ./usr/lib64/ ./usr/lib64/libntfs-3g.so ./usr/lib64/libntfs-3g.so.86 ./usr/lib64/libntfs-3g.so.86.0.0 ./usr/bin/ ./usr/bin/ntfsck ./usr/bin/ntfscat [etc] As well as copying files from the host, supermin can also unpack a tarball into the appliance. In the case of libguestfs-winsupport, we provide a tarball containing the ntfs-3g distribution (the ntfs-3g source is supplied in the libguestfs-winsupport source RPM). We only want to support customers using this for v2v and a few other virt operations (like virt-win-reg), so there are checks in libguestfs to stop it from being used for general filesystem access. 3. Some virt tools ------------------ Libguestfs is a C library with a C API, and guestfish is quite a low-level tool which basically offers direct access to the C API. To make things easier for end users, we built some higher level virt tools for particular tasks. These tools link to libguestfs, and some of them also use other libraries (libXML, libvirt directly, "qemu-img" directly, Glance, etc.) There are about a dozen virt tools provided by libguestfs. Notable tools include: - virt-edit: Edit a single file inside a VM. - virt-inspector: Inspect a disk image to find out if it contains an operating system [see below]. - virt-builder: Make a new VM. - virt-resize: Resize a VM. - virt-v2v: Convert a VM from VMware/Xen to run on KVM. The virt commands use the libguestfs APIs, but often in ways that would be hard / complicated for end users to do directly. For example, virt-resize does a lot of calculations to work out how the resized partitions should be laid out, and those calculations are too hard for most people to do by hand. 3.1 Inspection -------------- Quite a fundamental libguestfs API operation is called "inspection". Many of the virt tools start with inspection: eg. virt-edit, virt-v2v. The basic idea is we have a disk image (eg. a qcow2 file). The disk image comes from a virtual machine, but we don't know what operating system is installed inside the disk image. Inspection lets you look at any disk image, and will try to find any operating system(s) installed on there, and tell you interesting things, such as: - The OS type, version, architecture (eg. "windows", 6.1, "x86_64"). - The Linux distro (eg. "centos"). - What applications are installed. - Windows drive letter mappings (eg. "C:" => "/dev/sda2"). Inspection is quite fundamental for V2V, because what operations we have to perform on a guest depends on what the guest is. If it's Windows, we have to do a completely different set of things, from a RHEL guest. There is also a specific "virt-inspector" tool which just does inspection and then presents the results as XML: $ virt-inspector -a /tmp/centos-6.img linux x86_64 centos CentOS release 6.6 (Final) 6 6 / /boot ConsoleKit 0.4.1 3.el6 x86_64 ConsoleKit-libs 0.4.1 3.el6 x86_64 etc. 3.1.1 How does inspection work? ------------------------------- Inspection is basically a large number of heuristics. For example: - If the filesystem contains a file called "/etc/centos-release" then set the Linux distro to "centos". - If the filesystem contains a binary called "/bin/bash", then look at the ELF header of that binary to find the OS architecture. (But way more complicated, and handling Windows too.) If you want the real details of how inspection works, I suggest running virt-inspector with the -x option: $ virt-inspector -x -a /tmp/centos-6.img |& less 4. Misc topics (if we have time) -------------------------------- 4.1 remote images (ceph, NBD etc) --------------------------------- Qemu has multiple "block drivers". Some of those are for using different file formats like qcow2. Others enable remote disks to be accessed. Because libguestfs uses qemu, we get this (almost) for free. To access a remote resource you can use commands like: guestfish -a nbd://example.com guestfish -a rbd://example.com/pool/disk In RHEL, many drivers are intentionally disabled. Also, the drivers which are enabled are not particularly well tested at the moment. 4.2 parameters, environment variables ------------------------------------- There are a lot of parameters which control libguestfs. Some of these are exposed as environment variables, allowing them to be set easily outside programs. Examples: LIBGUESTFS_BACKEND (the "backend" setting), LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE (enable debugging/tracing). Documentation on environment variables: http://libguestfs.org/guestfs.3.html#environment-variables A question was asked about the "program" setting. When the C library creates a handle, it saves the name of the current program. You can also read or change this setting: $ guestfish > get-program guestfish > set-program virt-foo > get-program virt-foo In upstream libguestfs, this setting has no use. In RHEL we use it to enforce supportability requirements. 4.3 new features in libguestfs ------------------------------ Libguestfs upstream is mostly stable, but I am hoping to get a new test framework upstream in the 1.32 cycle (next 3-4 months). https://www.redhat.com/archives/libguestfs/2015-August/msg00022.html 4.4 copy_(device|file)_to_(device|file) --------------------------------------- There is a problem in the API which is that there is no difference between "/dev/sda" meaning the disk/device, and "/dev/sda" meaning a file called "sda" in the "/dev" directory. So instead of having a single 'copy' function, we need to tell libguestfs whether you want to copy between files or devices. 4.5 virt-customize vs. virt-builder ----------------------------------- Virt-customize takes an existing disk image containing a guest that you created somehow (maybe with virt-builder, maybe some other way), and it runs a command such as 'yum install openssh' on it. Virt-builder copies a template from http://libguestfs.org/download/builder/ (or other places), expands it, and then runs the 'yum install' command. So really they are quite similar, and in fact use exactly the same code: https://github.com/libguestfs/libguestfs/blob/master/customize/customize_run.ml# L96