X-Git-Url: http://git.annexia.org/?p=virt-mem.git;a=blobdiff_plain;f=HACKING;h=610d9763162274b9309bb21dd12726e4610198b3;hp=2fb11795d6624bbac87ce30157338897d987b236;hb=5465f41dc973c040cc5abd423640b2d4a118a159;hpb=5a960be177bdfbbb1d62f96490e355e3e3e54f12

diff --git a/HACKING b/HACKING
index 2fb1179..610d976 100644
--- a/HACKING
+++ b/HACKING
@@ -34,6 +34,11 @@ extract/
    subdirectories here correspond to the different Linux distributions
    and methods of getting at their kernels.
 
+extract/codegen/
+
+ - Tools to turn the kernel database into generated code which parses
+   the kernel structures.
+
 General structure of lib/virt_mem.ml
 ------------------------------------
 
@@ -45,21 +50,108 @@ which gets successively enhanced with extra data along the way:
 	process, load kernel images
 
 		|
-		|	(passes a 'Virt_mem_types.image0')
+		|
 		V
 
 	Find kernel symbols
 
 		|
-		|	(enhanced into a 'Virt_mem_types.image1')
+		|
 		V
 
 	Find kernel version (uname)
 
 		|
-		|	(enhanced into a 'Virt_mem_types.image2')
+		|
+		V
+
+	Find task_structs, net_devices, etc.
+
+		|
+		|
 		V
 
 	Call tool's "run" function.
 
-Tools can register other callbacks which get called at earlier stages.
\ No newline at end of file
+Tools can register other callbacks which get called at earlier stages.
+
+How it works
+------------
+
+(1) Getting the kernel image
+
+This is pretty easy (on QEMU/KVM anyway): There is a QEMU monitor
+command which reads out memory from the guest, and this is made
+available through the virDomainMemoryPeek call in libvirt.
+
+Kernel images are generally located at small number of known addresses
+(eg. 0xC010_0000 on x86).
+
+(2) Getting the kernel symbols.
+
+The Linux kernel contains two tables of kernel symbols - the usual
+kernel symbols used for exporting symbols to loadable modules, and
+'kallsyms' which is used for error reporting.  (Of course, specific
+Linux kernels may be compiled without one or other of these tables).
+
+The functions in modules lib/virt_mem_ksyms.ml and
+lib/virt_mem_kallsyms.ml deal with searching kernel memory for these
+two tables.
+
+(3) Getting the kernel version.
+
+The kernel has the kernel version information compiled in at a known
+symbol address, so once we have the kernel symbols it is relatively
+straightforward to get the kernel version.
+
+See lib/virt_mem_utsname.ml.
+
+(4) Process table / memory / network info etc.
+
+Note that we have the kernel symbols and the kernel version (and that
+information is pretty reliable).
+
+If we take the process table as an example, then it consists of a
+linked list of 'struct task_struct', starting at the symbol
+'init_task' (which corresponds to the "hidden" PID 0 / swapper task),
+and linked through a double-linked list in the 'tasks' member of this
+structure.
+
+We have the location of 'init_task', but struct task_struct varies
+greatly depending on: word size, kernel version, CONFIG_* settings,
+and vendor/additional patches.
+
+The problem is to work out the "shape" of task_struct, and we do this
+in two different ways:
+
+(Method 1) Precompiled task_struct.  We can easily and reliably
+determine the Linux kernel version (see virt-uname).  In theory we
+could compile a list of known kernel versions, check out their sources
+beforehand, and find the absolute layout of the task_struct (eg. using
+CIL).  This method would only work for known kernel versions, but has
+the advantage that all fields in the task_struct would be known.
+
+(Method 2) Fuzzy matched task_struct.  The task_struct has a certain
+internal structure which is stable even over many kernel revisions.
+For example, groups of pointers always occur together.  We search
+through init_task looking for these characteristic features and where
+a pointer is found to another task_struct we search that (recursively)
+on the assumption that those contain the same features at the same
+location.  This works well for pointers, but not so well for finding
+other fields (eg. uids, process name, etc).  We can defray the cost of
+searches by caching the results between runs.
+
+Currently we use Method 1, deriving the database of known kernels from
+gdb debugging information generated by Linux distributions when they
+build their kernels, and processing that with 'pahole' (from dwarves
+library).
+
+We have experimented with Method 2.  Currently work on it is postponed
+to a research project for a keen student at some point in the near
+future.  There are some early implementations of method 2 if you look
+back over the version control history.
+
+The database of known kernels is stored in kernels/ subdirectory.
+
+The functions to build the database by extracting debug information
+from Linux distributions is stored in extract/ subdirectory.
\ No newline at end of file