README

   1 virt-similarity: Find clusters of similar/cloned virtual machines
   2 Copyright (C) 2013 Red Hat Inc.
   3 ======================================================================
   4
   5 Compiling from source
   6 ---------------------
   7
   8 If checking out from git, then:
   9   autoreconf -i
  10
  11 Build it:
  12   ./configure
  13   make
  14
  15 Optionally:
  16   sudo make install
  17
  18 Requirements
  19 ------------
  20
  21 - ocaml >= 3.12.0
  22 - ocaml findlib
  23 - libguestfs >= 1.14
  24 - ocaml libguestfs bindings
  25
  26 Developers
  27 ----------
  28
  29 The upstream git repo is:
  30
  31 http://git.annexia.org/?p=virt-similarity.git;a=summary
  32
  33 Please send patches to the virt-tools mailing list:
  34
  35 http://www.redhat.com/mailman/listinfo/virt-tools-list
  36
  37 Notes on the technique used
  38 ---------------------------
  39
  40 (1) We use libguestfs to open each disk image.  This allows us to get
  41 at the raw data, in case the disk image is stored in some format like
  42 qcow2 or vmdk.  Also you could extend this program so it could
  43 understand encrypted disks.
  44
  45 http://libguestfs.org/
  46 http://libguestfs.org/guestfs-java.3.html
  47
  48 (2) For each disk, we split it into 64K blocks and hash each block.
  49 The reason for choosing 64K blocks is that it's the normal cluster
  50 size for qcow2, and the block size used by qemu-img etc.  The reason
  51 for doing the hashing is so that we can compare the disk images for
  52 similarity by holding the complete set of hashes in memory.  The
  53 hashing reduces each disk by a factor of 4096 (MD5) or 2048 (SHA-256)
  54 times, so for example a 10 GB disk image is reduced to a more
  55 manageable 2.5 or 5 MB.
  56
  57 NB: For speed the hashes are saved in a cache file called
  58 'similarity-cache' in the local directory.  You can just delete this
  59 file when done.
  60
  61 (3) We then compare each disk image, block by block, and record the
  62 difference between each pair of images.
  63
  64 Note that we DON'T do advanced Cluster Analysis on the disk images.
  65 There's not any point since the rebasing operation used by qemu-img
  66 can only handle simple differences at the block level; it cannot, for
  67 example, move blocks around or do fuzzy matches.
  68 http://en.wikipedia.org/wiki/Cluster_analysis
  69
  70 (4) We then output a tree (technically a 'Cladogram') showing a
  71 hierarchy of guests, using a simple hierarchical clustering algorithm,
  72 where we group the two closest guests, then that group with the next
  73 closest guest, and so forth.
  74
  75 http://en.wikipedia.org/wiki/Cladogram
  76 http://en.wikipedia.org/wiki/Hierarchical_clustering