Skip to content
Snippets Groups Projects
kvirtualization.md 15.9 KiB
Newer Older
Lukáš Krupčík's avatar
Lukáš Krupčík committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484
Virtualization 
==============

Running virtual machines on compute nodes

  

Introduction
------------

There are situations when Anselm's environment is not suitable for user
needs.

-   Application requires different operating system (e.g Windows),
    application is not available for Linux
-   Application requires different versions of base system libraries and
    tools
-   Application requires specific setup (installation, configuration) of
    complex software stack
-   Application requires privileged access to operating system
-   ... and combinations of above cases

 We offer solution for these cases - **virtualization**. Anselm's
environment gives the possibility to run virtual machines on compute
nodes. Users can create their own images of operating system with
specific software stack and run instances of these images as virtual
machines on compute nodes. Run of virtual machines is provided by
standard mechanism of [Resource Allocation and Job
Execution](../../resource-allocation-and-job-execution/introduction.html).

Solution is based on QEMU-KVM software stack and provides
hardware-assisted x86 virtualization.

Limitations
-----------

Anselm's infrastructure was not designed for virtualization. Anselm's
environment is not intended primary for virtualization, compute nodes,
storages and all infrastructure of Anselm is intended and optimized for
running HPC jobs, this implies suboptimal configuration of
virtualization and limitations.

Anselm's virtualization does not provide performance and all features of
native environment. There is significant performance hit (degradation)
in I/O performance (storage, network). Anselm's virtualization is not
suitable for I/O (disk, network) intensive workloads.

Virtualization has also some drawbacks, it is not so easy to setup
efficient solution.

Solution described in chapter
[HOWTO](virtualization.html#howto)
 is suitable for single node tasks, does not
introduce virtual machine clustering.

Please consider virtualization as last resort solution for your needs.

Please consult use of virtualization with IT4Innovation's support.

For running Windows application (when source code and Linux native
application are not available) consider use of Wine, Windows
compatibility layer. Many Windows applications can be run using Wine
with less effort and better performance than when using virtualization.

Licensing
---------

IT4Innovations does not provide any licenses for operating systems and
software of virtual machines. Users are ( >
in accordance with [Acceptable use policy
document](http://www.it4i.cz/acceptable-use-policy.pdf))
fully responsible for licensing all software running in virtual machines
on Anselm. Be aware of complex conditions of licensing software in
virtual environments.

Users are responsible for licensing OS e.g. MS Windows and all software
running in their virtual machines.

 HOWTO
----------

### Virtual Machine Job Workflow

We propose this job workflow:

Workflow](virtualization-job-workflow "Virtualization Job Workflow")

Our recommended solution is that job script creates distinct shared job
directory, which makes a central point for data exchange between
Anselm's environment, compute node (host) (e.g HOME, SCRATCH, local
scratch and other local or cluster filesystems) and virtual machine
(guest). Job script links or copies input data and instructions what to
do (run script) for virtual machine to job directory and virtual machine
process input data according instructions in job directory and store
output back to job directory. We recommend, that virtual machine is
running in so called [snapshot
mode](virtualization.html#snapshot-mode), image is
immutable - image does not change, so one image can be used for many
concurrent jobs.

### Procedure

1.  Prepare image of your virtual machine
2.  Optimize image of your virtual machine for Anselm's virtualization
3.  Modify your image for running jobs
4.  Create job script for executing virtual machine
5.  Run jobs

### Prepare image of your virtual machine

You can either use your existing image or create new image from scratch.

QEMU currently supports these image types or formats:

-   raw 
-   cloop 
-   cow 
-   qcow 
-   qcow2 
-   vmdk - VMware 3 & 4, or 6 image format, for exchanging images with
    that product
-   vdi - VirtualBox 1.1 compatible image format, for exchanging images
    with VirtualBox.

You can convert your existing image using qemu-img convert command.
Supported formats of this command are: blkdebug blkverify bochs cloop
cow dmg file ftp ftps host_cdrom host_device host_floppy http https
nbd parallels qcow qcow2 qed raw sheepdog tftp vdi vhdx vmdk vpc vvfat.

We recommend using advanced QEMU native image format qcow2.

[More about QEMU
Images](http://en.wikibooks.org/wiki/QEMU/Images)

### Optimize image of your virtual machine

Use virtio devices (for disk/drive and network adapter) and install
virtio drivers (paravirtualized drivers) into virtual machine. There is
significant performance gain when using virtio drivers. For more
information see [Virtio
Linux](http://www.linux-kvm.org/page/Virtio) and [Virtio
Windows](http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers).

Disable all   
unnecessary services
and tasks. Restrict all unnecessary operating system operations.

Remove all   
unnecessary software and
files.

  
Remove all paging
space, swap files, partitions, etc.

Shrink your image. (It is recommended to zero all free space and
reconvert image using qemu-img.)

### Modify your image for running jobs

Your image should run some kind of operating system startup script.
Startup script should run application and when application exits run
shutdown or quit virtual machine.

We recommend, that startup script

maps Job Directory from host (from compute node)
runs script (we call it "run script") from Job Directory and waits for
application's exit
-   for management purposes if run script does not exist wait for some
    time period (few minutes)

shutdowns/quits OS
For Windows operating systems we suggest using Local Group Policy
Startup script, for Linux operating systems rc.local, runlevel init
script or similar service.

Example startup script for Windows virtual machine:

    @echo off
    set LOG=c:startup.log
    set MAPDRIVE=z:
    set SCRIPT=%MAPDRIVE%run.bat
    set TIMEOUT=300

    echo %DATE% %TIME% Running startup script>%LOG%

    rem Mount share
    echo %DATE% %TIME% Mounting shared drive>%LOG%
    net use z: 10.0.2.4qemu >%LOG% 2>&1
    dir z: >%LOG% 2>&1
    echo. >%LOG%

    if exist %MAPDRIVE% (
      echo %DATE% %TIME% The drive "%MAPDRIVE%" exists>%LOG%

      if exist %SCRIPT% (
        echo %DATE% %TIME% The script file "%SCRIPT%"exists>%LOG%
        echo %DATE% %TIME% Running script %SCRIPT%>%LOG%
        set TIMEOUT=0
        call %SCRIPT%
      ) else (
        echo %DATE% %TIME% The script file "%SCRIPT%"does not exist>%LOG%
      )

    ) else (
      echo %DATE% %TIME% The drive "%MAPDRIVE%" does not exist>%LOG%
    )
    echo. >%LOG%

    timeout /T %TIMEOUT%

    echo %DATE% %TIME% Shut down>%LOG%
    shutdown /s /t 0

Example startup script maps shared job script as drive z: and looks for
run script called run.bat. If run script is found it is run else wait
for 5 minutes, then shutdown virtual machine.

### Create job script for executing virtual machine

Create job script according recommended  

[Virtual Machine Job
Workflow](virtualization.html#virtual-machine-job-workflow).

Example job for Windows virtual machine:

    #/bin/sh

    JOB_DIR=/scratch/$USER/win/${PBS_JOBID}

    #Virtual machine settings
    VM_IMAGE=~/work/img/win.img
    VM_MEMORY=49152
    VM_SMP=16

    # Prepare job dir
    mkdir -p ${JOB_DIR} && cd ${JOB_DIR} || exit 1
    ln -s ~/work/win .
    ln -s /scratch/$USER/data .
    ln -s ~/work/win/script/run/run-appl.bat run.bat

    # Run virtual machine
    export TMPDIR=/lscratch/${PBS_JOBID}
    module add qemu
    qemu-system-x86_64 
      -enable-kvm 
      -cpu host 
      -smp ${VM_SMP} 
      -m ${VM_MEMORY} 
      -vga std 
      -localtime 
      -usb -usbdevice tablet 
      -device virtio-net-pci,netdev=net0 
      -netdev user,id=net0,smb=${JOB_DIR},hostfwd=tcp::3389-:3389 
      -drive file=${VM_IMAGE},media=disk,if=virtio 
      -snapshot 
      -nographic

Job script links application data (win), input data (data) and run
script (run.bat) into job directory and runs virtual machine.

Example run script (run.bat) for Windows virtual machine:

    z:
    cd winappl
    call application.bat z:data z:output

Run script runs application from shared job directory (mapped as drive
z:), process input data (z:data) from job directory  and store output
to job directory (z:output).

### Run jobs

Run jobs as usual, see  [Resource Allocation and Job
Execution](../../resource-allocation-and-job-execution/introduction.html).
Use only full node allocation for virtualization jobs.

### Running Virtual Machines

Virtualization is enabled only on compute nodes, virtualization does not
work on login nodes.

Load QEMU environment module:

    $ module add qemu

Get help

    $ man qemu

Run virtual machine (simple)

    $ qemu-system-x86_64 -hda linux.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -vnc :0

    $ qemu-system-x86_64 -hda win.img   -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -vnc :0

You can access virtual machine by VNC viewer (option -vnc) connecting to
IP address of compute node. For VNC you must use [VPN
network](../../accessing-the-cluster/vpn-access.html).

Install virtual machine from iso file

    $ qemu-system-x86_64 -hda linux.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -cdrom linux-install.iso -boot d -vnc :0

    $ qemu-system-x86_64 -hda win.img   -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -cdrom win-install.iso -boot d -vnc :0

Run virtual machine using optimized devices, user network backend with
sharing and port forwarding, in snapshot mode

    $ qemu-system-x86_64 -drive file=linux.img,media=disk,if=virtio -enable-kvm -cpu host -smp 16 -m 32768 -vga std -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::2222-:22 -vnc :0 -snapshot

    $ qemu-system-x86_64 -drive file=win.img,media=disk,if=virtio -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::3389-:3389 -vnc :0 -snapshot

Thanks to port forwarding you can access virtual machine via SSH (Linux)
or RDP (Windows) connecting to IP address of compute node (and port 2222
for SSH). You must use [VPN
network](../../accessing-the-cluster/vpn-access.html).

Keep in mind, that if you use virtio devices, you must have virtio
drivers installed on your virtual machine.

### Networking and data sharing

For networking virtual machine we suggest to use (default) user network
backend (sometimes called slirp). This network backend NATs virtual
machines and provides useful services for virtual machines as DHCP, DNS,
SMB sharing, port forwarding.

In default configuration IP network 10.0.2.0/24 is used, host has IP
address 10.0.2.2, DNS server 10.0.2.3, SMB server 10.0.2.4 and virtual
machines obtain address from range 10.0.2.15-10.0.2.31. Virtual machines
have access to Anselm's network via NAT on compute node (host).

Simple network setup

    $ qemu-system-x86_64 ... -net nic -net user

(It is default when no -net options are given.)

Simple network setup with sharing and port forwarding (obsolete but
simpler syntax, lower performance)

    $ qemu-system-x86_64 ... -net nic -net user,smb=/scratch/$USER/tmp,hostfwd=tcp::3389-:3389

Optimized network setup with sharing and port forwarding

    $ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::2222-:22

### Advanced networking

Internet access**

Sometime your virtual machine needs access to internet (install
software, updates, software activation, etc). We suggest solution using
Virtual Distributed Ethernet (VDE) enabled QEMU with SLIRP running on
login node tunnelled to compute node. Be aware, this setup has very low
performance, the worst performance of all described solutions.

Load VDE enabled QEMU environment module (unload standard QEMU module
first if necessary).

    $ module add qemu/2.1.2-vde2

Create virtual network switch.

    $ vde_switch -sock /tmp/sw0 -mgmt /tmp/sw0.mgmt -daemon

Run SLIRP daemon over SSH tunnel on login node and connect it to virtual
network switch.

    $ dpipe vde_plug /tmp/sw0 = ssh login1 $VDE2_DIR/bin/slirpvde -s - --dhcp &

Run qemu using vde network backend, connect to created virtual switch.

Basic setup (obsolete syntax)

    $ qemu-system-x86_64 ... -net nic -net vde,sock=/tmp/sw0

Setup using virtio device (obsolete syntax)

    $ qemu-system-x86_64 ... -net nic,model=virtio -net vde,sock=/tmp/sw0

Optimized setup

    $ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net0 -netdev vde,id=net0,sock=/tmp/sw0

TAP interconnect**

Both user and vde network backend have low performance. For fast
interconnect (10Gbps and more) of compute node (host) and virtual
machine (guest) we suggest using Linux kernel TAP device.

Cluster Anselm provides TAP device tap0 for your job. TAP interconnect
does not provide any services (like NAT, DHCP, DNS, SMB, etc.) just raw
networking, so you should provide your services if you need them.

Run qemu with TAP network backend:

    $ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net1 
                           -netdev tap,id=net1,ifname=tap0,script=no,downscript=no

Interface tap0 has IP address 192.168.1.1 and network mask 255.255.255.0
(/24). In virtual machine use IP address from range
192.168.1.2-192.168.1.254. For your convenience some ports on tap0
interface are redirected to higher numbered ports, so you as
non-privileged user can provide services on these ports.

Redirected ports:

-   DNS udp/53->udp/3053, tcp/53->tcp3053
-   DHCP udp/67->udp3067
-   SMB tcp/139->tcp3139, tcp/445->tcp3445).

You can configure IP address of virtual machine statically or
dynamically. For dynamic addressing provide your DHCP server on port
3067 of tap0 interface, you can also provide your DNS server on port
3053 of tap0 interface for example:

    $ dnsmasq --interface tap0 --bind-interfaces -p 3053 --dhcp-alternate-port=3067,68 --dhcp-range=192.168.1.15,192.168.1.32 --dhcp-leasefile=/tmp/dhcp.leasefile

You can also provide your SMB services (on ports 3139, 3445) to obtain
high performance data sharing.

Example smb.conf (not optimized)

    [global]
    socket address=192.168.1.1
    smb ports = 3445 3139

    private dir=/tmp/qemu-smb
    pid directory=/tmp/qemu-smb
    lock directory=/tmp/qemu-smb
    state directory=/tmp/qemu-smb
    ncalrpc dir=/tmp/qemu-smb/ncalrpc
    log file=/tmp/qemu-smb/log.smbd
    smb passwd file=/tmp/qemu-smb/smbpasswd
    security = user
    map to guest = Bad User
    unix extensions = no
    load printers = no
    printing = bsd
    printcap name = /dev/null
    disable spoolss = yes
    log level = 1
    guest account = USER
    [qemu]
    path=/scratch/USER/tmp
    read only=no
    guest ok=yes
    writable=yes
    follow symlinks=yes
    wide links=yes
    force user=USER

(Replace USER with your login name.)

Run SMB services

    smbd -s /tmp/qemu-smb/smb.conf

 

Virtual machine can of course have more than one network interface
controller, virtual machine can use more than one network backend. So,
you can combine for example use network backend and TAP interconnect.

### Snapshot mode

In snapshot mode image is not written, changes are written to temporary
file (and discarded after virtual machine exits). **It is strongly
recommended mode for running your jobs.** Set TMPDIR environment
variable to local scratch directory for placement temporary files.

    $ export TMPDIR=/lscratch/${PBS_JOBID}
    $ qemu-system-x86_64 ... -snapshot

### Windows guests

For Windows guests we recommend these options, life will be easier:

    $ qemu-system-x86_64 ... -localtime -usb -usbdevice tablet