2023-hardware/notes.txt

   1 ## Agenda
   2
   3 I've prepared some information about
   4
   5  * current RISC-V hardware
   6  * future RISC-V hardware
   7  * performance numbers
   8  * notes on QEMU
   9
  10
  11 ## Koji
  12
  13 http://fedora.riscv.rocks/koji/
  14 https://koji.fedoraproject.org/koji/
  15
  16 Primary architecture or koji-shadow?
  17
  18
  19 ## Current RISC-V hardware
  20
  21 * Milk-V Pioneer (SG2042)
  22 * Sophgo 2U server (SG2042)
  23
  24   - Possible performance and quality issues
  25   - Expensive
  26   - Likely to end up as e-waste soon
  27
  28 * SiFive Unmatched
  29
  30   - Slow
  31   - Doesn't support any recent extensions, esp. V, H
  32   - No BMC, but might use BMC-on-PCIe card
  33
  34 * Single board computers (SBCs)
  35
  36   - VisionFive2 (StarFive JH7xxx), Lichee Pi 4A, ...
  37   - Slow
  38   - Limited memory, cores
  39   - Reliability issues
  40   - Upfront & ongoing engineering headaches integrating into a 19" rack
  41
  42 * QEMU
  43
  44   - Performance numbers below, but not great
  45   - Really using x86-64 hardware
  46   - Hardware (x86-64) is well known and easy to manage
  47   - No "e-waste", servers can be repurposed
  48   - Supports all the latest extensions
  49   - Supports any amount of RAM, large numbers of vCPUs
  50   - We can add new extensions and fix bugs relatively easily
  51
  52
  53
  54 ## Future RISC-V hardware
  55
  56 * Sophgo SG2380
  57
  58   - 16 x SiFive P670
  59   - announced yesterday
  60
  61 * SiFive P870
  62
  63   - announced in August 2023
  64
  65 * StarFive JH8100
  66
  67   - TSMC 12nm
  68   - H extension
  69   - power and efficiency versions, but they are not fully compatible
  70   - 8 lanes of gen3 PCIe
  71   - 4 USB 3.2 gen2
  72   - miniITX development board
  73
  74 * Ventana
  75
  76 * Rivos
  77
  78
  79 ## Performance numbers
  80
  81
  82                binutils     openssl      python3.12   mingw-gcc
  83
  84 i686           1589         1577         3411         4292
  85
  86 x86-64         1419         1172         2462         1827
  87
  88 aarch64        1573          811 (?)     1845         2521
  89
  90 ppc64le        2165         1291         3073         4388
  91
  92 s390x          2553         1380         1984 (?)     6824
  93
  94
  95
  96 qemu-system-riscv64 16 vCPUs, 16 GB
  97 on AMD Ryzen 9 7950X server
  98
  99   (LTO)        4493         3052        14502        12428
 100                  +217%        +160%       +489%        +580%
 101
 102   (no LTO)     3267         1351         6353        (failed)
 103                  +130%        +15%        +158%
 104
 105 qemu-system-riscv64 32 vCPUs, 16 GB
 106 on AMD Genoa-X server
 107
 108   (LTO)        6841         3182
 109
 110   (no LTO)     5115         1882         (failed)
 111
 112
 113 VisionFive 2
 114   (LTO)        7202         8823                     (crashed in LTO step)
 115                  +408%        +653%
 116
 117   (no LTO)     3274         2059         11627
 118                  +130%        +75%
 119
 120 koji.riscv.rocks (mainly HiFive Unmatched)
 121
 122   (LTO)       12743        15098        26058       67947
 123
 124
 125
 126 ## Single thread performance
 127
 128 qemu-system-riscv64   912
 129 on AMD Genoa-X
 130
 131 HiFive Unmatched      616
 132
 133 qemu-system-riscv64   598
 134 on AMD 7950x
 135
 136 VisionFive 2          425
 137
 138 Koji/ppc64le          144
 139
 140 Koji/i686             105
 141
 142 Koji/x86-64           100
 143
 144 Koji/aarch64           89
 145
 146 Koji/s390x             65
 147
 148 AMD Genoa-X (x86-64)   36
 149
 150 AMD 7950x (x86-64)     35
 151
 152
 153
 154 ## QEMU observations
 155
 156 "TCG" is the name for QEMU's software emulation, eg. RISC-V fully
 157 emulated guest on x86-64 host.
 158
 159 Works using Translation Blocks (TBs) which translate basic blocks of
 160 guest code.
 161
 162 Well understood (by me), easy to fix simpler issues.
 163
 164  - I posted a patch yesterday which gets ~ +6% performance gain
 165
 166 A few tips to make TCG run (a bit) fast(er):
 167
 168  - Compile with -march=native (+4%)
 169
 170  - Profile with perf
 171
 172  - Don't overprovision host CPUs
 173
 174     * However pinning vCPUs to pCPUs didn't really help
 175
 176  - Give it plenty of guest & host RAM
 177
 178     * Measured memory overhead on host is up to 40% after running
 179       for some time
 180
 181     * Host TBs track guest page cache; as long as a translated
 182       executable remains in the guest page cache, it will not be
 183       retranslated
 184
 185  - Don't restart the VM
 186
 187  - Software TLB
 188
 189  - Fast vs slow jumps
 190
 191  - Performance scales almost linearly with # vCPUs
 192
 193     * I compared 4 vCPU vs 16 vCPU virtual machine, and parallelized
 194       build speed inside the guest increased by 3.3x, which is almost linear