Compare revisions

5099e238 · 5099e238 · 5099e238 · 5099e238 · 5099e238 · 5099e238
--- a/docs.it4i/barbora/img/gpu-v100.png
+++ b/docs.it4i/barbora/img/gpu-v100.png
--- a/docs.it4i/barbora/img/hdr.jpg
+++ b/docs.it4i/barbora/img/hdr.jpg
--- a/docs.it4i/barbora/img/quadrop6000.jpg
+++ b/docs.it4i/barbora/img/quadrop6000.jpg
--- a/docs.it4i/barbora/introduction.md
+++ b/docs.it4i/barbora/introduction.md
+# Introduction
+
+Welcome to Barbora supercomputer cluster. The Barbora cluster consists of 201 compute nodes, totaling 7232 compute cores with 44544 GB RAM, giving over 848 TFLOP/s theoretical peak performance.
+
+Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2. Read more in [Hardware Overview][1].
+
+The cluster runs with an operating system compatible with the Red Hat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
+
+The user data shared file system and job data shared file system are available to users.
+
+The [Slurm][b] workload manager provides [computing resources allocations and job execution][3].
+
+Read more on how to [apply for resources][4], [obtain login credentials][5] and [access the cluster][6].
+
+![](img/BullSequanaX.png)
+
+[1]: hardware-overview.md
+[2]: ../environment-and-modules.md
+[3]: ../general/resources-allocation-policy.md
+[4]: ../general/applying-for-resources.md
+[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
+[6]: ../general/shell-and-data-access.md
+
+[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
+[b]: https://slurm.schedmd.com/
--- a/docs.it4i/barbora/network.md
+++ b/docs.it4i/barbora/network.md
+# Network
+
+All of the compute and login nodes of Barbora are interconnected through a [InfiniBand][a] HDR 200 Gbps network and a Gigabit Ethernet network.
+
+Compute nodes and the service infrastructure is connected by the HDR100 technology
+that allows one 200 Gbps HDR port (aggregation 4x 50 Gbps) to be divided into two HDR100 ports with 100 Gbps (2x 50 Gbps) bandwidth.
+
+The cabling between the L1 and L2 layer is realized by HDR cabling,
+connecting the end devices is realized by so called Y or splitter cable (1x HRD200 - 2x HDR100).
+
+![](img/hdr.jpg)
+
+**The computing network thus implemented fulfills the following parameters**
+
+* 100Gbps
+* Latencies less than 10 microseconds (0.6 μs end-to-end, <90ns switch hop)
+* Adaptive routing support
+* MPI communication support
+* IP protocol support (IPoIB)
+* Support for SCRATCH Data Storage and NVMe over Fabric Data Storage.
+
+## Mellanox QM8700 40-Ports Switch
+
+**Performance**
+
+* 40x HDR 200 Gb/s ports in a 1U switch
+* 80x HDR100 100 Gb/s ports in a 1U switch
+* 16 Tb/s aggregate switch throughput
+* Up to 15.8 billion messages-per-second
+* 90ns switch latency
+
+**Optimized Design**
+
+* 1+1 redundant & hot-swappable power
+* 80 gold+ and energy star certified power supplies
+* Dual-core x86 CPU
+
+**Advanced Design**
+
+* Adaptive routing
+* Collective offloads (Mellanox SHARP technology)
+* VL mapping (VL2VL)
+
+![](img/QM8700.jpg)
+
+## BullSequana XH2000 HDRx WH40 MODULE
+
+* Mellanox QM8700 switch modified for direct liquid cooling (Atos Cold Plate), with form factor for installing the Bull Sequana XH2000 rack
+
+![](img/XH2000.png)
+
+[a]: http://en.wikipedia.org/wiki/InfiniBand
--- a/docs.it4i/barbora/storage.md
+++ b/docs.it4i/barbora/storage.md
+# Storage
+
+There are three main shared file systems on the Barbora cluster: [HOME][1], [SCRATCH][2], and [PROJECT][5]. All login and compute nodes may access same data on shared file systems. Compute nodes are also equipped with local (non-shared) scratch, RAM disk, and tmp file systems.
+
+## Archiving
+
+Do not use shared filesystems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][3], which is available via SSHFS.
+
+## Shared Filesystems
+
+Barbora computer provides three main shared filesystems, the [HOME filesystem][1], [SCRATCH filesystem][2], and the [PROJECT filesystems][5].
+
+All filesystems are accessible via the Infiniband network.
+
+The HOME and PROJECT filesystems are realized as NFS filesystem.
+
+The SCRATCH filesystem is realized as a parallel Lustre filesystem.
+
+Extended ACLs are provided on both Lustre filesystems for sharing data with other users using fine-grained control
+
+### Understanding the Lustre Filesystems
+
+A user file on the [Lustre filesystem][a] can be divided into multiple chunks (stripes) and stored across a subset of the object storage targets (OSTs) (disks). The stripes are distributed among the OSTs in a round-robin fashion to ensure load balancing.
+
+When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the [file's stripes][b]. Once the file is opened and the client obtains the striping information, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and OSTs to perform I/O operations such as locking, disk allocation, storage, and retrieval.
+
+If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency, so that all clients see consistent results.
+
+There is default stripe configuration for Barbora Lustre filesystems. However, users can set the following stripe parameters for their own directories or files to get optimum I/O performance:
+
+1. `stripe_size` the size of the chunk in bytes; specify with k, m, or g to use units of KB, MB, or GB, respectively; the size must be an even multiple of 65,536 bytes; default is 1MB for all Barbora Lustre filesystems
+1. `stripe_count` the number of OSTs to stripe across; default is 1 for Barbora Lustre filesystems one can specify -1 to use all OSTs in the filesystem.
+1. `stripe_offset` the index of the OST where the first stripe is to be placed; default is -1 which results in random selection; using a non-default value is NOT recommended.
+
+!!! note
+    Setting stripe size and stripe count correctly for your needs may significantly affect the I/O performance.
+
+Use the `lfs getstripe` command for getting the stripe parameters. Use `lfs setstripe` for setting the stripe parameters to get optimal I/O performance. The correct stripe setting depends on your needs and file access patterns.
+
+```console
+$ lfs getstripe dir|filename
+$ lfs setstripe -s stripe_size -c stripe_count -o stripe_offset dir|filename
+```
+
+Example:
+
+```console
+$ lfs getstripe /scratch/projname
+$ lfs setstripe -c -1 /scratch/projname
+$ lfs getstripe /scratch/projname
+```
+
+In this example, we view the current stripe setting of the /scratch/projname/ directory. The stripe count is changed to all OSTs and verified. All files written to this directory will be striped over 5 OSTs
+
+Use `lfs check osts` to see the number and status of active OSTs for each filesystem on Barbora. Learn more by reading the man page:
+
+```console
+$ lfs check osts
+$ man lfs
+```
+
+### Hints on Lustre Stripping
+
+!!! note
+    Increase the `stripe_count` for parallel I/O to the same file.
+
+When multiple processes are writing blocks of data to the same file in parallel, the I/O performance for large files will improve when the `stripe_count` is set to a larger value. The stripe count sets the number of OSTs to which the file will be written. By default, the stripe count is set to 1. While this default setting provides for efficient access of metadata (for example to support the `ls -l` command), large files should use stripe counts of greater than 1. This will increase the aggregate I/O bandwidth by using multiple OSTs in parallel instead of just one. A rule of thumb is to use a stripe count approximately equal to the number of gigabytes in the file.
+
+Another good practice is to make the stripe count be an integral factor of the number of processes performing the write in parallel, so that you achieve load balance among the OSTs. For example, set the stripe count to 16 instead of 15 when you have 64 processes performing the writes.
+
+!!! note
+    Using a large stripe size can improve performance when accessing very large files
+
+Large stripe size allows each client to have exclusive access to its own part of a file. However, it can be counterproductive in some cases if it does not match your I/O pattern. The choice of stripe size has no effect on a single-stripe file.
+
+Read more [here][c].
+
+### Lustre on Barbora
+
+The architecture of Lustre on Barbora is composed of two metadata servers (MDS) and two data/object storage servers (OSS).
+
+ Configuration of the SCRATCH storage
+
+* 2x Metadata server
+* 2x Object storage server
+* Lustre object storage
+  * One disk array NetApp E2800
+  * 54x 8TB 10kRPM 2,5” SAS HDD
+  * 5 x RAID6(8+2) OST Object storage target
+  * 4 hotspare
+* Lustre metadata storage
+  * One disk array NetApp E2600
+  * 12 300GB SAS 15krpm disks
+  * 2 groups of 5 disks in RAID5 Metadata target
+  * 2 hot-spare disks
+
+### HOME File System
+
+The HOME filesystem is mounted in directory /home. Users home directories /home/username reside on this filesystem. Accessible capacity is 28TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 25GB per user. Should 25GB prove insufficient, contact [support][d], the quota may be lifted upon request.
+
+!!! note
+    The HOME filesystem is intended for preparation, evaluation, processing and storage of data generated by active Projects.
+
+The HOME filesystem should not be used to archive data of past Projects or other unrelated data.
+
+The files on HOME filesystem will not be deleted until the end of the [user's lifecycle][4].
+
+The filesystem is backed up, so that it can be restored in case of a catastrophic failure resulting in significant data loss. However, this backup is not intended to restore old versions of user data or to restore (accidentally) deleted files.
+
+| HOME filesystem      |                 |
+| -------------------- | --------------- |
+| Accesspoint          | /home/username  |
+| Capacity             | 28TB           |
+| Throughput           | 1GB/s          |
+| User space quota     | 25GB           |
+| User inodes quota    | 500K           |
+| Protocol             | NFS             |
+
+### SCRATCH File System
+
+The SCRATCH is realized as Lustre parallel file system and is available from all login and computational nodes. There are 5 OSTs dedicated for the SCRATCH file system.
+
+The SCRATCH filesystem is mounted in the `/scratch/project/PROJECT_ID` directory created automatically with the `PROJECT_ID` project. Accessible capacity is 310TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 10TB per user. The purpose of this quota is to prevent runaway programs from filling the entire filesystem and deny service to other users. Should 10TB prove insufficient, contact [support][d], the quota may be lifted upon request.
+
+!!! note
+    The Scratch filesystem is intended for temporary scratch data generated during the calculation as well as for high-performance access to input and output files. All I/O intensive jobs must use the SCRATCH filesystem as their working directory.
+
+    Users are advised to save the necessary data from the SCRATCH filesystem to HOME filesystem after the calculations and clean up the scratch files.
+
+!!! warning
+    Files on the SCRATCH filesystem that are **not accessed for more than 90 days** will be automatically **deleted**.
+
+The SCRATCH filesystem is realized as Lustre parallel filesystem and is available from all login and computational nodes. Default stripe size is 1MB, stripe count is 1. There are 5 OSTs dedicated for the SCRATCH filesystem.
+
+!!! note
+    Setting stripe size and stripe count correctly for your needs may significantly affect the I/O performance.
+
+| SCRATCH filesystem   |           |
+| -------------------- | --------- |
+| Mountpoint           | /scratch  |
+| Capacity             | 310TB     |
+| Throughput           | 5GB/s     |
+| Throughput [Burst]   | 38GB/s    |
+| User space quota     | 10TB      |
+| User inodes quota    | 10M       |
+| Default stripe size  | 1MB       |
+| Default stripe count | 1         |
+| Number of OSTs       | 5         |
+
+### PROJECT File System
+
+The PROJECT data storage is a central storage for projects'/users' data on IT4Innovations that is accessible from all clusters.
+For more information, see the [PROJECT storage][6] section.
+
+### Disk Usage and Quota Commands
+
+Disk usage and user quotas can be checked and reviewed using the `it4ifsusage` command. You can see an example output [here][9].
+
+To have a better understanding of where the space is exactly used, you can use following command:
+
+```console
+$ du -hs dir
+```
+
+Example for your HOME directory:
+
+```console
+$ cd /home
+$ du -hs * .[a-zA-z0-9]* | grep -E "[0-9]*G|[0-9]*M" | sort -hr
+258M     cuda-samples
+15M      .cache
+13M      .mozilla
+5,5M     .eclipse
+2,7M     .idb_13.0_linux_intel64_app
+```
+
+This will list all directories with MegaBytes or GigaBytes of consumed space in your actual (in this example HOME) directory. List is sorted in descending order from largest to smallest files/directories.
+
+### Extended ACLs
+
+Extended ACLs provide another security mechanism beside the standard POSIX ACLs, which are defined by three entries (for owner/group/others). Extended ACLs have more than the three basic entries. In addition, they also contain a mask entry and may contain any number of named user and named group entries.
+
+ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner.
+
+For more information, see the [Access Control List][7] section of the documentation.
+
+## Local Filesystems
+
+### TMP
+
+Each node is equipped with local /tmp RAMDISK directory. The /tmp directory should be used to work with  temporary files. Old files in /tmp directory are automatically purged.
+
+### SCRATCH and RAMDISK
+
+Each node is equipped with RAMDISK storage accessible at /tmp, /lscratch and /ramdisk. The RAMDISK capacity is 180GB. Data placed on RAMDISK occupies the node RAM memory of 192GB total. The RAMDISK directory should only be used to work with  temporary files, where very high throughput or I/O performance is required. Old files in RAMDISK directory are automatically purged with job's end.
+
+#### Global RAM Disk
+
+The Global RAM disk spans the local RAM disks of all the allocated nodes within a single job.
+For more information, see the [Job Features][8] section.
+
+## Summary
+
+| Mountpoint | Usage                     | Protocol | Net Capacity     | Throughput                     | Limitations | Access                   | Services                        |
+| ---------- | ------------------------- | -------- | --------------   | ------------------------------ | ----------- | -----------------------  | ------------------------------- |
+| /home      | home directory            | NFS      | 28TB             | 1GB/s                          | Quota 25GB  | Compute and login nodes  | backed up                       |
+| /scratch   | scratch temoporary        | Lustre   | 310TB            | 5GB/s, 30GB/s burst buffer     | Quota 10TB  | Compute and login nodes  |files older 90 days autoremoved |
+| /lscratch  | local scratch ramdisk     | tmpfs    | 180GB           | 130GB/s                         | none        | Node local               | auto purged after job end       |
+
+[1]: #home-file-system
+[2]: #scratch-file-system
+[3]: ../storage/cesnet-storage.md
+[4]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
+[5]: #project-file-system
+[6]: ../storage/project-storage.md
+[7]: ../storage/standard-file-acl.md
+[8]: ../job-features.md#global-ram-disk
+[9]: ../storage/project-storage.md#project-quotas
+
+[a]: http://www.nas.nasa.gov
+[b]: http://www.nas.nasa.gov/hecc/support/kb/Lustre_Basics_224.html#striping
+[c]: http://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace
+[d]: https://support.it4i.cz/rt
+[e]: http://man7.org/linux/man-pages/man1/nfs4_setfacl.1.html
--- a/docs.it4i/barbora/visualization.md
+++ b/docs.it4i/barbora/visualization.md
+# Visualization Servers
+
+Remote visualization with [VirtualGL][3] is available on two nodes.
+
+* 2 nodes
+* 32 cores in total
+* 2x Intel Skylake Gold 6130 – 16-core@2,1 GHz processors per node
+* 192 GB DDR4 2667 MT/s of physical memory per node (12x 16 GB)
+* BullSequana X450-E5 blade servers
+* 2150.4 GFLOP/s per compute node
+* 1x 1 GB Ethernet and 2x 10 GB Ethernet
+* 1x HDR100 IB port
+* 2x SSD 240 GB
+
+![](img/bullsequanaX450-E5.png)
+
+## NVIDIA Quadro P6000
+
+* GPU Memory: 24 GB GDDR5X
+* Memory Interface: 384-bit
+* Memory Bandwidth: Up to 432 GB/s
+* NVIDIA CUDA® Cores: 3840
+* System Interface: PCI Express 3.0 x16
+* Max Power Consumption: 250 W
+* Thermal Solution: Active
+* Form Factor: 4.4”H x 10.5” L, Dual Slot, Full Height
+* Display Connectors: 4x DP 1.4 + DVI-D DL
+* Max Simultaneous Displays: 4 direct, 4 DP1.4 Multi-Stream
+* Max DP 1.4 Resolution: 7680 x 4320 @ 30 Hz
+* Max DVI-D DL Resolution: 2560 x 1600 @ 60 Hz
+* Graphics APIs: Shader Model 5.1, OpenGL 4.5, DirectX 12.0, Vulkan 1.0,
+* Compute APIs: CUDA, DirectCompute, OpenCL™
+* Floating-Point Performance-Single Precision: 12.6 TFLOP/s, Peak
+
+![](img/quadrop6000.jpg)
+
+## Resource Allocation Policy
+
+| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
+|-------|----------------|-------------------|-------|-----------|----------|---------------|----------|
+| qviz Visualization queue | yes | none required | 2 | 4 | 150 | no | 1h/8h |
+
+## References
+
+* [Graphical User Interface][1]
+* [VPN Access][2]
+
+[1]: ../general/shell-and-data-access.md#graphical-user-interface
+[2]: ../general/shell-and-data-access.md#vpn-access
+[3]: ../software/viz/vgl.md
--- a/docs.it4i/cloud/.gitkeep
+++ b/docs.it4i/cloud/.gitkeep
--- a/docs.it4i/cloud/einfracz-cloud.md
+++ b/docs.it4i/cloud/einfracz-cloud.md
+# e-INFRA CZ Cloud Ostrava
+
+Ostrava cloud consists of 22 nodes from the [Karolina][a] supercomputer.
+The cloud site is built on top of OpenStack,
+which is a free open standard cloud computing platform.
+
+## Access
+
+To acces the cloud you must:
+
+* have an [e-Infra CZ account][3],
+* be a member of an [active project][b].
+
+The dashboard is available at [https://ostrava.openstack.cloud.e-infra.cz/][6].
+
+You can specify resources/quotas for your project.
+For more information, see the [Quota Limits][5] section.
+
+## Creating First Instance
+
+To create your first VM instance, follow the [e-INFRA CZ guide][4].
+Note that the guide is similar for clouds in Brno and Ostrava,
+so make sure that you follow steps for Ostrava cloud where applicable.
+
+### Process Automatization
+
+You can automate the process using Terraform or Openstack.
+
+#### Terraform
+
+Prerequisites:
+
+* Linux/Mac/WSL terminal BASH shell
+* installed Terraform and sshuttle
+* downloaded [application credentials][9] from OpenStack Horizon dashboard and saved as a `project_openrc.sh.inc` text file
+
+Follow the guide: [https://code.it4i.cz/terraform][8]
+
+#### OpenStack
+
+Prerequisites:
+
+* Linux/Mac/WSL terminal BASH shell
+* installed [OpenStack client][7]
+
+Follow the guide: [https://code.it4i.cz/commandline][10]
+
+Run commands:
+
+```console
+source project_openrc.sh.inc
+```
+
+```console
+./cmdline-demo.sh basic-infrastructure-1
+```
+
+## Technical Reference
+
+For the list of deployed OpenStack services, see the [list of components][1].
+
+More information can be found on the [e-INFRA CZ website][2].
+
+[1]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/openstack-components/
+[2]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/
+[3]: https://docs.e-infra.cz/account/
+[4]: https://docs.e-infra.cz/compute/openstack/getting-started/creating-first-infrastructure/
+[5]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-g2-site/quota-limits/
+[6]: https://ostrava.openstack.cloud.e-infra.cz/
+[7]: https://docs.fuga.cloud/how-to-use-the-openstack-cli-tools-on-linux
+[8]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/terraform
+[9]: https://docs.e-infra.cz/compute/openstack/how-to-guides/obtaining-api-key/
+[10]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/commandline
+
+[a]: ../karolina/introduction.md
+[b]: ../general/access/project-access.md
--- a/docs.it4i/cloud/it4i-cloud.md
+++ b/docs.it4i/cloud/it4i-cloud.md
+# IT4I Cloud
+
+IT4I cloud consists of 14 nodes from the [Karolina][a] supercomputer.
+The cloud site is built on top of OpenStack,
+which is a free open standard cloud computing platform.
+
+!!! Note
+    The guide describes steps for personal projects.<br>
+    Some steps may differ for large projects.<br>
+    For large project, apply for resources to the [Allocation Committee][11].
+
+## Access
+
+To access the cloud you must be a member of an active EUROHPC project,
+or fall into the **Access Category B**, i.e. [Access For Thematic HPC Resource Utilisation][11].
+
+A personal OpenStack project is required. Request one by contacting [IT4I Support][12].
+
+The dashboard is available at [https://cloud.it4i.cz][6].
+
+You can see quotas set for the IT4I Cloud in the [Quota Limits][f] section.
+
+## Creating First Instance
+
+To create your first VM instance, follow the steps below:
+
+### Log In
+
+Go to [https://cloud.it4i.cz][6], enter your LDAP username and password and choose the `IT4I_LDAP` domain. After you sign in, you will be redirected to the dashboard.
+
+![](../img/login.png)
+
+### Create Key Pair
+
+SSH key is required for remote access to your instance.
+
+1. Go to **Project > Compute > Key Pairs** and click the **Create Key Pair** button.
+
+    ![](../img/keypairs.png)
+
+1. In the Create Key Pair window, name your key pair, select `SSH Key` for key type and confirm by clicking Create Key Pair.
+
+    ![](../img/keypairs1.png)
+
+1. Download and manage the private key according to your operating system.
+
+### Update Security Group
+
+To be able to remotely access your VM instance, you have to allow access in the security group.
+
+1. Go to **Project > Network > Security Groups** and click on **Manage Rules**, for the default security group.
+
+    ![](../img/securityg.png)
+
+1. Click on **Add Rule**, choose **SSH**, and leave the remaining fields unchanged.
+
+    ![](../img/securityg1.png)
+
+### Create VM Instance
+
+1. In **Compute > Instances**, click **Launch Instance**.
+
+    ![](../img/instance.png)
+
+1. Choose Instance Name, Description, and number of instances. Click **Next**.
+
+    ![](../img/instance1.png)
+
+1. Choose an image from which to boot the instance. Choose to delete the volume after instance delete. Click **Next**.
+
+    ![](../img/instance2.png)
+
+1. Choose the hardware resources of the instance by selecting a flavor. Additional volumes for data can be attached later on. Click **Next**.
+
+    ![](../img/instance3.png)
+
+1. Select the network and continue to **Security Groups**.
+
+    ![](../img/instance4.png)
+
+1. Allocate the security group with SSH rule that you added in the [Update Security Group](it4i-cloud.md#update-security-group) step. Then click **Next** to go to the **Key Pair**.
+
+    ![](../img/securityg2.png)
+
+1. Select the key that you created in the [Create Key Pair][g] section and launch the instance.
+
+    ![](../img/instance5.png)
+
+### Associate Floating IP
+
+1. Click on the **Associate** button next to the floating IP.
+
+    ![](../img/floatingip.png)
+
+1. Select Port to be associated with the instance, then click the **Associate** button.
+
+Now you can join the VM using your preferred SSH client.
+
+## Process Automatization
+
+You can automate the process using Openstack.
+
+### OpenStack
+
+Prerequisites:
+
+* Linux/Mac/WSL terminal BASH shell
+* installed [OpenStack client][7]
+
+Follow the guide: [https://code.it4i.cz/commandline][10]
+
+Run commands:
+
+```console
+source project_openrc.sh.inc
+```
+
+```console
+./cmdline-demo.sh basic-infrastructure-1
+```
+
+[1]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/openstack-components/
+[2]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/
+[3]: https://docs.e-infra.cz/account/
+[4]: https://docs.e-infra.cz/compute/openstack/getting-started/creating-first-infrastructure/
+[5]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-g2-site/quota-limits/
+[6]: https://cloud.it4i.cz
+[7]: https://docs.fuga.cloud/how-to-use-the-openstack-cli-tools-on-linux
+[8]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/terraform
+[9]: https://docs.e-infra.cz/compute/openstack/how-to-guides/obtaining-api-key/
+[10]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/commandline
+[11]: https://www.it4i.cz/en/for-users/computing-resources-allocation
+[12]: mailto:support@it4i.cz @@
+
+[a]: ../karolina/introduction.md
+[b]: ../general/access/project-access.md
+[c]: einfracz-cloud.md
+[d]: ../general/accessing-the-clusters/vpn-access.md
+[e]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
+[f]: it4i-quotas.md
+[g]: it4i-cloud.md#create-key-pair
+
+
--- a/docs.it4i/cloud/it4i-quotas.md
+++ b/docs.it4i/cloud/it4i-quotas.md
+# IT4I Cloud Quotas
+
+| Resource                              | Quota |
+|---------------------------------------|-------|
+| Instances                             |    10 |
+| VCPUs                                 |    20 |
+| RAM                                   |  32GB |
+| Volumes                               |    20 |
+| Volume Snapshots                      |    12 |
+| Volume Storage                        |   500 |
+| Floating-IPs                          |     1 |
+| Security Groups                       |    10 |
+| Security Group Rules                  |   100 |
+| Networks                              |     1 |
+| Ports                                 |    10 |
+| Routers                               |     1 |
+| Backups                               |    12 |
+| Groups                                |    10 |
+| rbac_policies                         |    10 |
+| Subnets                               |     1 |
+| Subnet_pools                          |    -1 |
+| Fixed-ips                             |    -1 |
+| Injected-file-size                    | 10240 |
+| Injected-path-size                    |   255 |
+| Injected-files                        |     5 |
+| Key-pairs                             |   100 |
+| Properties                            |   128 |
+| Server-groups                         |    10 |
+| Server-group-members                  |    10 |
+| Backup-gigabytes                      |  1002 |
+| Per-volume-gigabytes                  |    -1 |
--- a/docs.it4i/config.yml
+++ b/docs.it4i/config.yml
+host: irods.it4i.cz
+port: 1247
+proxy_user: some_user
+client_user: some_user
+zone: IT4I
+
+authscheme: "pam"
+ssl_ca_cert_file: "~/.irods/chain_geant_ov_rsa_ca_4_full.pem"
+ssl_encryption_key_size: 32
+ssl_encryption_algorithm: "AES-256-CBC"
+ssl_encryption_salt_size: 8
+ssl_encryption_hash_rounds: 16
+
+path_mappings:
+  - irods_path: /IT4I/home/some_user
+    mapping_path: /
+    resource_type: dir
--- a/docs.it4i/cs/.gitkeep
+++ b/docs.it4i/cs/.gitkeep
--- a/docs.it4i/cs/accessing.md
+++ b/docs.it4i/cs/accessing.md
+# Accessing Complementary Systems
+
+Complementary systems can be accessed at `login.cs.it4i.cz`
+by any user with an active account assigned to an active project.
+
+**SSH is required** to access Complementary systems.
+
+## Data Storage
+
+### Home
+
+The `/home` file system is shared across all Complementary systems. Note that this file system is **not** shared with the file system on IT4I clusters.
+
+### Scratch
+
+There are local `/lscratch` storages on individual nodes.
+
+### PROJECT
+
+Complementary systems are connected to the [PROJECT storage][1].
+
+[1]: ../storage/project-storage.md
--- a/docs.it4i/cs/guides/amd.md
+++ b/docs.it4i/cs/guides/amd.md
+# Using AMD Partition
+
+For testing your application on the AMD partition,
+you need to prepare a job script for that partition or use the interactive job:
+
+```console
+salloc -N 1 -c 64 -A PROJECT-ID -p p03-amd --gres=gpu:4 --time=08:00:00
+```
+
+where:
+
+- `-N 1` means allocating one server,
+- `-c 64` means allocating 64 cores,
+- `-A` is your project,
+- `-p p03-amd` is AMD partition,
+- `--gres=gpu:4` means allocating all 4 GPUs of the node,
+- `--time=08:00:00` means allocation for 8 hours.
+
+You have also an option to allocate subset of the resources only,
+by reducing the `-c` and `--gres=gpu` to smaller values.
+
+```console
+salloc -N 1 -c 48 -A PROJECT-ID -p p03-amd --gres=gpu:3 --time=08:00:00
+salloc -N 1 -c 32 -A PROJECT-ID -p p03-amd --gres=gpu:2 --time=08:00:00
+salloc -N 1 -c 16 -A PROJECT-ID -p p03-amd --gres=gpu:1 --time=08:00:00
+```
+
+!!! Note
+    p03-amd01 server has hyperthreading **enabled** therefore htop shows 128 cores.<br>
+    p03-amd02 server has hyperthreading **disabled** therefore htop shows 64 cores.
+
+## Using AMD MI100 GPUs
+
+The AMD GPUs can be programmed using the [ROCm open-source platform](https://docs.amd.com/).
+
+ROCm and related libraries are installed directly in the system.
+You can find it here:
+
+```console
+/opt/rocm/
+```
+
+The actual version can be found here:
+
+```console
+[user@p03-amd02.cs]$ cat /opt/rocm/.info/version
+
+5.5.1-74
+```
+
+## Basic HIP Code
+
+The first way how to program AMD GPUs is to use HIP.
+
+The basic vector addition code in HIP looks like this.
+This a full code and you can copy and paste it into a file.
+For this example we use `vector_add.hip.cpp`.
+
+```console
+#include <cstdio>
+#include <hip/hip_runtime.h>
+
+
+
+__global__ void add_vectors(float * x, float * y, float alpha, int count)
+{
+    long long idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if(idx < count)
+        y[idx] += alpha * x[idx];
+}
+
+int main()
+{
+    // number of elements in the vectors
+    long long count = 10;
+
+    // allocation and initialization of data on the host (CPU memory)
+    float * h_x = new float[count];
+    float * h_y = new float[count];
+    for(long long i = 0; i < count; i++)
+    {
+        h_x[i] = i;
+        h_y[i] = 10 * i;
+    }
+
+    // print the input data
+    printf("X:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_x[i]);
+    printf("\n");
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+
+    // allocation of memory on the GPU device
+    float * d_x;
+    float * d_y;
+    hipMalloc(&d_x, count * sizeof(float));
+    hipMalloc(&d_y, count * sizeof(float));
+
+    // copy the data from host memory to the device
+    hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
+    hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
+
+    int tpb = 256;
+    int bpg = (count - 1) / tpb + 1;
+    // launch the kernel on the GPU
+    add_vectors<<< bpg, tpb >>>(d_x, d_y, 100, count);
+    // hipLaunchKernelGGL(add_vectors, bpg, tpb, 0, 0, d_x, d_y, 100, count);
+
+    // copy the result back to CPU memory
+    hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
+
+    // print the results
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+
+    // free the allocated memory
+    hipFree(d_x);
+    hipFree(d_y);
+    delete[] h_x;
+    delete[] h_y;
+
+    return 0;
+}
+```
+
+To compile the code we use `hipcc` compiler.
+For compiler information, use `hipcc --version`:
+
+```console
+[user@p03-amd02.cs ~]$ hipcc --version
+
+HIP version: 5.5.30202-eaf00c0b
+AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.5.1 23194 69ef12a7c3cc5b0ccf820bc007bd87e8b3ac3037)
+Target: x86_64-unknown-linux-gnu
+Thread model: posix
+InstalledDir: /opt/rocm-5.5.1/llvm/bin
+```
+
+The code is compiled a follows:
+
+```console
+hipcc vector_add.hip.cpp -o vector_add.x
+```
+
+The correct output of the code is:
+
+```console
+[user@p03-amd02.cs ~]$ ./vector_add.x
+X:    0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00
+Y:    0.00   10.00   20.00   30.00   40.00   50.00   60.00   70.00   80.00   90.00
+Y:    0.00  110.00  220.00  330.00  440.00  550.00  660.00  770.00  880.00  990.00
+```
+
+More details on HIP programming is in the [HIP Programming Guide](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.5/page/Introduction_to_HIP_Programming_Guide.html)
+
+## HIP and ROCm Libraries
+
+The list of official AMD libraries can be found [here](https://docs.amd.com/category/libraries).
+
+The libraries are installed in the same directory is ROCm
+
+```console
+/opt/rocm/
+```
+
+Following libraries are installed:
+
+```console
+drwxr-xr-x  4 root root   44 Jun  7 14:09 hipblas
+drwxr-xr-x  3 root root   17 Jun  7 14:09 hipblas-clients
+drwxr-xr-x  3 root root   29 Jun  7 14:09 hipcub
+drwxr-xr-x  4 root root   44 Jun  7 14:09 hipfft
+drwxr-xr-x  3 root root   25 Jun  7 14:09 hipfort
+drwxr-xr-x  4 root root   32 Jun  7 14:09 hiprand
+drwxr-xr-x  4 root root   44 Jun  7 14:09 hipsolver
+drwxr-xr-x  4 root root   44 Jun  7 14:09 hipsparse
+```
+
+and
+
+```console
+drwxr-xr-x  4 root root   32 Jun  7 14:09 rocalution
+drwxr-xr-x  4 root root   44 Jun  7 14:09 rocblas
+drwxr-xr-x  4 root root   44 Jun  7 14:09 rocfft
+drwxr-xr-x  4 root root   32 Jun  7 14:09 rocprim
+drwxr-xr-x  4 root root   32 Jun  7 14:09 rocrand
+drwxr-xr-x  4 root root   44 Jun  7 14:09 rocsolver
+drwxr-xr-x  4 root root   44 Jun  7 14:09 rocsparse
+drwxr-xr-x  3 root root   29 Jun  7 14:09 rocthrust
+```
+
+## Using HipBlas Library
+
+The basic code in HIP that uses hipBlas looks like this.
+This a full code and you can copy and paste it into a file.
+For this example we use `hipblas.hip.cpp`.
+
+```console
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <hip/hip_runtime.h>
+#include <hipblas/hipblas.h>
+
+
+int main()
+{
+    srand(9600);
+
+    int width = 10;
+    int height = 7;
+    int elem_count = width * height;
+
+
+    // initialization of data in CPU memory
+
+    float * h_A;
+    hipHostMalloc(&h_A, elem_count * sizeof(*h_A));
+    for(int i = 0; i < elem_count; i++)
+        h_A[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("Matrix A:\n");
+    for(int r = 0; r < height; r++)
+    {
+        for(int c = 0; c < width; c++)
+            printf("%6.3f  ", h_A[r + height * c]);
+        printf("\n");
+    }
+
+    float * h_x;
+    hipHostMalloc(&h_x, width * sizeof(*h_x));
+    for(int i = 0; i < width; i++)
+        h_x[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("vector x:\n");
+    for(int i = 0; i < width; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+    float * h_y;
+    hipHostMalloc(&h_y, height * sizeof(*h_y));
+    for(int i = 0; i < height; i++)
+        h_x[i] = 100.0f + i;
+    printf("vector y:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+
+    // initialization of data in GPU memory
+
+    float * d_A;
+    size_t pitch_A;
+    hipMallocPitch((void**)&d_A, &pitch_A, height * sizeof(*d_A), width);
+    hipMemcpy2D(d_A, pitch_A, h_A, height * sizeof(*d_A), height * sizeof(*d_A), width, hipMemcpyHostToDevice);
+    int lda = pitch_A / sizeof(float);
+
+    float * d_x;
+    hipMalloc(&d_x, width * sizeof(*d_x));
+    hipMemcpy(d_x, h_x, width * sizeof(*d_x), hipMemcpyHostToDevice);
+
+    float * d_y;
+    hipMalloc(&d_y, height * sizeof(*d_y));
+    hipMemcpy(d_y, h_y, height * sizeof(*d_y), hipMemcpyHostToDevice);
+
+
+    // basic calculation of the result on the CPU
+
+    float alpha=2.0f, beta=10.0f;
+
+    for(int i = 0; i < height; i++)
+        h_y[i] *= beta;
+    for(int r = 0; r < height; r++)
+        for(int c = 0; c < width; c++)
+            h_y[r] += alpha * h_x[c] * h_A[r + height * c];
+    printf("result y CPU:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+
+
+    // calculation of the result on the GPU using the hipBLAS library
+
+    hipblasHandle_t blas_handle;
+    hipblasCreate(&blas_handle);
+
+    hipblasSgemv(blas_handle, HIPBLAS_OP_N, height, width, &alpha, d_A, lda, d_x, 1, &beta, d_y, 1);
+    hipDeviceSynchronize();
+
+    hipblasDestroy(blas_handle);
+
+
+    // copy the GPU result to CPU memory and print it
+    hipMemcpy(h_y, d_y, height * sizeof(*d_y), hipMemcpyDeviceToHost);
+    printf("result y BLAS:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+
+
+    // free all the allocated memory
+    hipFree(d_A);
+    hipFree(d_x);
+    hipFree(d_y);
+    hipHostFree(h_A);
+    hipHostFree(h_x);
+    hipHostFree(h_y);
+
+    return 0;
+}
+```
+
+The code compilation can be done as follows:
+
+```console
+hipcc hipblas.hip.cpp -o hipblas.x -lhipblas
+```
+
+## Using HipSolver Library
+
+The basic code in HIP that uses hipSolver looks like this.
+This a full code and you can copy and paste it into a file.
+For this example we use `hipsolver.hip.cpp`.
+
+```console
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <algorithm>
+#include <hipsolver/hipsolver.h>
+#include <hipblas/hipblas.h>
+
+int main()
+{
+    srand(63456);
+
+    int size = 10;
+
+
+    // allocation and initialization of data on host. this time we use std::vector
+
+    int h_A_ld = size;
+    int h_A_pitch = h_A_ld * sizeof(float);
+    std::vector<float> h_A(size * h_A_ld);
+    for(int r = 0; r < size; r++)
+        for(int c = 0; c < size; c++)
+            h_A[r * h_A_ld + c] = (10.0 * rand()) / RAND_MAX;
+    printf("System matrix A:\n");
+    for(int r = 0; r < size; r++)
+    {
+        for(int c = 0; c < size; c++)
+            printf("%6.3f  ", h_A[r * h_A_ld + c]);
+        printf("\n");
+    }
+
+    std::vector<float> h_b(size);
+    for(int i = 0; i < size; i++)
+        h_b[i] = (10.0 * rand()) / RAND_MAX;
+    printf("RHS vector b:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+    std::vector<float> h_x(size);
+
+
+    // memory allocation on the device and initialization
+
+    float * d_A;
+    size_t d_A_pitch;
+    hipMallocPitch((void**)&d_A, &d_A_pitch, size, size);
+    int d_A_ld = d_A_pitch / sizeof(float);
+
+    float * d_b;
+    hipMalloc(&d_b, size * sizeof(float));
+
+    float * d_x;
+    hipMalloc(&d_x, size * sizeof(float));
+
+    int * d_piv;
+    hipMalloc(&d_piv, size * sizeof(int));
+
+    int * info;
+    hipMallocManaged(&info, sizeof(int));
+
+    hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
+    hipMemcpy(d_b, h_b.data(), size * sizeof(float), hipMemcpyHostToDevice);
+
+
+    // solving the system using hipSOLVER
+
+    hipsolverHandle_t solverHandle;
+    hipsolverCreate(&solverHandle);
+
+    int wss_trf, wss_trs; // wss = WorkSpace Size
+    hipsolverSgetrf_bufferSize(solverHandle, size, size, d_A, d_A_ld, &wss_trf);
+    hipsolverSgetrs_bufferSize(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, &wss_trs);
+    float * workspace;
+    int wss = std::max(wss_trf, wss_trs);
+    hipMalloc(&workspace, wss * sizeof(float));
+
+    hipsolverSgetrf(solverHandle, size, size, d_A, d_A_ld, workspace, wss, d_piv, info);
+    hipsolverSgetrs(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, workspace, wss, info);
+
+    hipMemcpy(d_x, d_b, size * sizeof(float), hipMemcpyDeviceToDevice);
+    hipMemcpy(h_x.data(), d_x, size * sizeof(float), hipMemcpyDeviceToHost);
+    printf("Solution vector x:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+    hipFree(workspace);
+
+    hipsolverDestroy(solverHandle);
+
+
+    // perform matrix-vector multiplication A*x using hipBLAS to check if the solution is correct
+
+    hipblasHandle_t blasHandle;
+    hipblasCreate(&blasHandle);
+
+    float alpha = 1;
+    float beta = 0;
+    hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
+    hipblasSgemv(blasHandle, HIPBLAS_OP_N, size, size, &alpha, d_A, d_A_ld, d_x, 1, &beta, d_b, 1);
+    hipDeviceSynchronize();
+
+    hipblasDestroy(blasHandle);
+
+    for(int i = 0; i < size; i++)
+        h_b[i] = 0;
+    hipMemcpy(h_b.data(), d_b, size * sizeof(float), hipMemcpyDeviceToHost);
+    printf("Check multiplication vector Ax:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+
+    // free all the allocated memory
+
+    hipFree(info);
+    hipFree(d_piv);
+    hipFree(d_x);
+    hipFree(d_b);
+    hipFree(d_A);
+
+    return 0;
+}
+```
+
+The code compilation can be done as follows:
+
+```console
+hipcc hipsolver.hip.cpp -o hipsolver.x -lhipblas -lhipsolver
+```
+
+## Using OpenMP Offload to Program AMD GPUs
+
+The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard
+and a subset of the OpenMP 5.0 standard.
+Fortran, C/C++ compilers, and corresponding runtime libraries are included.
+
+The OpenMP toolchain is automatically installed as part of the standard ROCm installation
+and is available under `/opt/rocm/llvm`. The sub-directories are:
+
+- `bin` : Compilers (flang and clang) and other binaries.
+- `examples` : The usage section below shows how to compile and run these programs.
+- `include` : Header files.
+- `lib` : Libraries including those required for target offload.
+- `lib-debug` : Debug versions of the above libraries.
+
+More information can be found in the [AMD OpenMP Support Guide](https://docs.amd.com/bundle/OpenMP-Support-Guide-v5.5/page/Introduction_to_OpenMP_Support_Guide.html).
+
+## Compilation of OpenMP Code
+
+Basic example that uses OpenMP offload is here.
+Again, code is complete and can be copied and pasted into a file.
+Here we use `vadd.cpp`.
+
+```console
+#include <cstdio>
+#include <cstdlib>
+
+int main(int argc, char ** argv)
+{
+    long long count = 1 << 20;
+    if(argc > 1)
+        count = atoll(argv[1]);
+    long long print_count = 16;
+    if(argc > 2)
+        print_count = atoll(argv[2]);
+
+    long long * a = new long long[count];
+    long long * b = new long long[count];
+    long long * c = new long long[count];
+
+#pragma omp parallel for
+    for(long long i = 0; i < count; i++)
+    {
+        a[i] = i;
+        b[i] = 10 * i;
+    }
+
+    printf("A: ");
+    for(long long i = 0; i < print_count; i++)
+        printf("%3lld ", a[i]);
+    printf("\n");
+
+    printf("B: ");
+    for(long long i = 0; i < print_count; i++)
+        printf("%3lld ", b[i]);
+    printf("\n");
+
+#pragma omp target map(to: a[0:count],b[0:count]) map(from: c[0:count])
+#pragma omp teams distribute parallel for
+    for(long long i = 0; i < count; i++)
+    {
+        c[i] = a[i] + b[i];
+    }
+
+    printf("C: ");
+    for(long long i = 0; i < print_count; i++)
+        printf("%3lld ", c[i]);
+    printf("\n");
+
+    delete[] a;
+    delete[] b;
+    delete[] c;
+
+    return 0;
+}
+```
+
+This code can be compiled like this:
+
+```console
+/opt/rocm/llvm/bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 vadd.cpp -o vadd.x
+```
+
+These options are required for target offload from an OpenMP program:
+
+- `-target x86_64-pc-linux-gnu`
+- `-fopenmp`
+- `-fopenmp-targets=amdgcn-amd-amdhsa`
+- `-Xopenmp-target=amdgcn-amd-amdhsa`
+
+This flag specifies the GPU architecture of targeted GPU.
+You need to chage this when moving for instance to LUMI with MI250X GPU.
+The MI100 GPUs presented in CS have code `gfx908`:
+
+- `-march=gfx908`
+
+Note: You also have to include the `O0`, `O2`, `O3` or `O3` flag.
+Without this flag the execution of the compiled code fails.
--- a/docs.it4i/cs/guides/arm.md
+++ b/docs.it4i/cs/guides/arm.md
+# Using ARM Partition
+
+For testing your application on the ARM partition,
+you need to prepare a job script for that partition or use the interactive job:
+
+```
+salloc -A PROJECT-ID -p p01-arm
+```
+
+On the partition, you should reload the list of modules:
+
+```
+ml architecture/aarch64
+```
+
+For compilation, `gcc` and `OpenMPI` compilers are available.
+Hence, the compilation process should be the same as on the `x64` architecture.
+
+Let's have the following `hello world` example:
+
+```
+#include "mpi.h"
+#include "omp.h"
+
+int main(int argc, char **argv)
+{
+        int rank;
+        MPI_Init(&argc, &argv);
+        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+        #pragma omp parallel
+        {
+                printf("Hello on rank %d, thread %d\n", rank, omp_get_thread_num());
+        }
+        MPI_Finalize();
+}
+```
+
+You can compile and run the example:
+
+```
+ml OpenMPI/4.1.4-GCC-11.3.0
+mpic++ -fopenmp hello.cpp -o hello
+mpirun -n 4 ./hello
+```
+
+Please see [gcc options](https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html) for more advanced compilation settings.
+No complications are expected as long as the application does not use any intrinsic for `x64` architecture.
+If you want to use intrinsic,
+[SVE](https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics) instruction set is available.
--- a/docs.it4i/cs/guides/grace.md
+++ b/docs.it4i/cs/guides/grace.md
+# Using NVIDIA Grace Partition
+
+For testing your application on the NVIDIA Grace Partition,
+you need to prepare a job script for that partition or use the interactive job:
+
+```console
+salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00
+```
+
+where:
+
+- `-N 1` means allocation single node,
+- `-c 144` means allocation 144 cores,
+- `-p p11-grace` is NVIDIA Grace partition,
+- `--time=08:00:00` means allocation for 8 hours.
+
+## Available Toolchains
+
+The platform offers three toolchains:
+
+- Standard GCC (as a module `ml GCC`)
+- [NVHPC](https://developer.nvidia.com/hpc-sdk) (as a module `ml NVHPC`)
+- [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang) (installed in `/opt/nvidia/clang`)
+
+!!! note
+    The NVHPC toolchain showed strong results with minimal amount of tuning necessary in our initial evaluation.
+
+### GCC Toolchain
+
+The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as
+
+```cpp
+for(int i = 0; i < 1000000; ++i) {
+    // Iterations dependent in "i"
+    // ...
+    for(int j = 0; j < 8; ++j) {
+        // but independent in "j"
+        // ...
+    }
+}
+```
+
+may emit scalar code for the inner loop leading to no vectorization being used at all.
+
+### Clang (For Grace) Toolchain
+
+The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags `-O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8` or pragmas such as `#pragma clang loop vectorize_width(8)` and `#pragma clang loop unroll(disable)`.
+
+```cpp
+for(int i = 0; i < 1000000; ++i) {
+    // Iterations dependent in "i"
+    // ...
+    #pragma clang loop unroll(disable) vectorize_width(8)
+    for(int j = 0; j < 8; ++j) {
+        // but independent in "j"
+        // ...
+    }
+}
+```
+
+!!! note
+    Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified `vectorize_width` is larger than availiable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).
+
+### NVHPC Toolchain
+
+The NVHPC toolchain handled aforementioned case without any additional tuning. Simple `-O3 -march=native -fast` should be therefore sufficient.
+
+## Basic Math Libraries
+
+The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as `-lblas` and `-llapack` for BLAS and LAPACK respectively (`lp64` and `ilp64` versions are also included).
+
+!!! note
+    The Grace platform doesn't include CUDA-capable GPU, therefore `nvcc` will fail with an error. This means that `nvc`, `nvc++` and `nvfortran` should be used instead.
+
+### NVIDIA Performance Libraries
+
+The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions:
+
+- BLACS: `-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}`
+- BLAS: `-lnvpl_blas_{lp64,ilp64}_{seq,gomp}`
+- FFTW: `-lnvpl_fftw`
+- LAPACK: `-lnvpl_lapack_{lp64,ilp64}_{seq,gomp}`
+- ScaLAPACK: `-lnvpl_scalapack_{lp64,ilp64}`
+- RAND: `-lnvpl_rand` or `-lnvpl_rand_mt`
+- SPARSE: `-lnvpl_sparse`
+
+This package should be compatible with all availiable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also [NVPL](https://docs.nvidia.com/nvpl).
+
+### Recommended BLAS Library
+
+We recommend to use the multi-threaded BLAS library from the NVPL package.
+
+!!! note
+    It is important to pin the processes using **OMP_PROC_BIND=spread**
+
+Example:
+
+```console
+$ ml NVHPC
+$ nvc -O3 -march=native myprog.c -o myprog -lnvpl_blas_lp64_gomp
+$ OMP_PROC_BIND=spread ./myprog
+```
+
+## Basic Communication Libraries
+
+The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (`ml OpenMPI`). The following example
+
+```cpp
+#include <mpi.h>
+#include <sched.h>
+#include <omp.h>
+
+int main(int argc, char **argv)
+{
+        int rank;
+        MPI_Init(&argc, &argv);
+        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+        #pragma omp parallel
+        {
+                printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu());
+        }
+        MPI_Finalize();
+}
+```
+
+can be compiled and run as follows
+
+```console
+ml OpenMPI
+mpic++ -fast -fopenmp hello.cpp -o hello
+OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello
+```
+
+In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.
+
+## Simple BLAS Application
+
+The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
+
+Stationary probability vector estimation in `C++`:
+
+```cpp
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include "cblas.h"
+
+const size_t ITERATIONS  = 32;
+const size_t MATRIX_SIZE = 1024;
+
+int main(int argc, char *argv[])
+{
+    const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
+
+    std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+        a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
+    a[0] = 0.5f;
+
+    std::vector<float> w1(matrixElements, 0.0f);
+    std::vector<float> w2(matrixElements, 0.0f);
+
+    std::copy(a.begin(), a.end(), w1.begin());
+
+    std::vector<float> *t1, *t2;
+    t1 = &w1;
+    t2 = &w2;
+
+    auto c1 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < ITERATIONS; ++i)
+    {
+        std::fill(t2->begin(), t2->end(), 0.0f);
+
+        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
+                    1.0f, t1->data(), MATRIX_SIZE,
+                    a.data(), MATRIX_SIZE,
+                    1.0f, t2->data(), MATRIX_SIZE);
+
+        std::swap(t1, t2);
+    }
+
+    auto c2 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+    {
+        std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
+    }
+
+    std::cout << std::endl;
+
+    std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+
+    return 0;
+}
+```
+
+Stationary probability vector estimation in `Fortran`:
+
+```fortran
+program main
+    implicit none
+
+    integer :: matrix_size, iterations
+    integer :: i
+    real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
+    real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
+    real, pointer :: out_data(:), out_diag(:)
+    integer :: cr, cm, c1, c2
+
+    iterations  = 32
+    matrix_size = 1024
+
+    call system_clock(count_rate=cr)
+    call system_clock(count_max=cm)
+
+    allocate(a(matrix_size, matrix_size))
+    allocate(w1(matrix_size, matrix_size))
+    allocate(w2(matrix_size, matrix_size))
+
+    a(:,:) = 1.0 / real(matrix_size)
+    a(:,1) = 0.5 / real(matrix_size - 1)
+    a(1,1) = 0.5
+
+    w1 = a
+    w2(:,:) = 0.0
+
+    t1 => w1
+    t2 => w2
+
+    call system_clock(c1)
+
+    do i = 0, iterations
+        t2(:,:) = 0.0
+
+        call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
+
+        tmp => t1
+        t1  => t2
+        t2  => tmp
+    end do
+
+    call system_clock(c2)
+
+    out_data(1:size(t1)) => t1
+    out_diag => out_data(1::matrix_size+1)
+
+    print *, out_diag
+    print *, "Elapsed Time: ", (c2 - c1) / real(cr)
+
+    deallocate(a)
+    deallocate(w1)
+    deallocate(w2)
+end program main
+```
+
+### Using NVHPC Toolchain
+
+The C++ version of the example can be compiled with NVHPC and ran as follows
+
+```console
+ml NVHPC
+nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main
+OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
+```
+
+The Fortran version is just as simple:
+
+```console
+ml NVHPC
+nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x
+OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
+```
+
+!!! note
+    It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one.
+
+### Using Clang (For Grace) Toolchain
+
+Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example.
+
+```console
+ml NVHPC
+/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main
+```
+
+!!! note
+    NVHPC module is used just for the `cblas.h` include in this case. This can be avoided by changing the code to use `nvpl_blas.h` instead.
+
+## Additional Resources
+
+- [https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/][1]
+- [https://developer.nvidia.com/hpc-sdk][2]
+- [https://developer.nvidia.com/grace/clang][3]
+- [https://docs.nvidia.com/nvpl][4]
+
+[1]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/
+[2]: https://developer.nvidia.com/hpc-sdk
+[3]: https://developer.nvidia.com/grace/clang
+[4]: https://docs.nvidia.com/nvpl
--- a/docs.it4i/cs/guides/hm_management.md
+++ b/docs.it4i/cs/guides/hm_management.md
+# Heterogeneous Memory Management on Intel Platforms
+
+Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
+
+## Overview
+
+At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
+
+```bash
+ml memkind
+```
+
+### Process Level (NUMACTL)
+
+The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
+
+```bash
+numactl --membind <node_ids_set>
+```
+
+or select single preffered node
+
+```bash
+numactl --preffered <node_id>
+```
+
+where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
+
+Convenient way to check `numactl` configuration is
+
+```bash
+numactl -s
+```
+
+which prints configuration in its execution environment eg.
+
+```bash
+numactl --membind 8-15 numactl -s
+policy: bind
+preferred node: 0
+physcpubind: 0 1 2 ... 189 190 191
+cpubind: 0 1 2 3 4 5 6 7
+nodebind: 0 1 2 3 4 5 6 7
+membind: 8 9 10 11 12 13 14 15
+```
+
+The last row shows allocations memory are restricted to NUMA nodes `8-15`.
+
+### Allocation Level (MEMKIND)
+
+The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
+
+```cpp
+void *pData = malloc(<SIZE>);
+/* ... */
+free(pData);
+```
+
+with
+
+```cpp
+#include <memkind.h>
+
+void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
+/* ... */
+memkind_free(NULL, pData); // "kind" parameter is deduced from the address
+```
+
+Similarly other memory types can be chosen.
+
+!!! note
+    The allocation will return `NULL` pointer when memory of specified kind is not available.
+
+## High Bandwidth Memory (HBM)
+
+Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
+
+```bash
+numactl -H
+```
+
+which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
+
+![](../../img/cs/guides/p10_numa_sc4_flat.png)
+
+### Process Level
+
+With this we can easily restrict application to DDR DRAM or HBM memory:
+
+```bash
+# Only DDR DRAM
+numactl --membind 0-7 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:          369745.8     0.043355     0.043273     0.043588
+Scale:         366989.8     0.043869     0.043598     0.045355
+Add:           378054.0     0.063652     0.063483     0.063899
+Triad:         377852.5     0.063621     0.063517     0.063884
+
+# Only HBM
+numactl --membind 8-15 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:         1128430.1     0.015214     0.014179     0.015615
+Scale:        1045065.2     0.015814     0.015310     0.016309
+Add:          1096992.2     0.022619     0.021878     0.024182
+Triad:        1065152.4     0.023449     0.022532     0.024559
+```
+
+The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
+
+Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
+
+```bash
+#!/bin/bash
+numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
+```
+
+and can be used as
+
+```bash
+mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
+```
+
+(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
+
+```bash
+mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
+```
+
+is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
+![](../../img/cs/guides/p10_stream_dram.png)
+![](../../img/cs/guides/p10_stream_hbm.png)
+
+### Allocation Level
+
+Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
+
+```cpp
+#include <memkind.h>
+// ...
+STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+// ...
+memkind_free(NULL, a);
+memkind_free(NULL, b);
+memkind_free(NULL, c);
+```
+
+Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
+The code then has to be linked with `memkind` library
+
+```bash
+gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
+```
+
+and can be run as
+
+```bash
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
+```
+
+While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
+
+![](../../img/cs/guides/p10_stream_memkind.png)
+
+With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
+
+## Simple Application
+
+One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
+
+```cpp
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include <cmath>
+#include <cstring>
+#include <omp.h>
+#include <memkind.h>
+
+const size_t N_DATA_SIZE  = 2 * 1024 * 1024 * 1024ull;
+const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
+const size_t N_ITERS      = 10;
+
+#if defined(HBM)
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_HBW_ALL
+#else
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_REGULAR
+#endif
+
+int main(int argc, char *argv[])
+{
+    const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
+
+    double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
+    size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
+
+    #pragma omp parallel
+    {
+        drand48_data state;
+        srand48_r(omp_get_thread_num(), &state);
+
+        #pragma omp for
+        for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            drand48_r(&state, &pData[i]);
+    }
+
+    auto c1 = std::chrono::steady_clock::now();
+
+    for(size_t it = 0; it < N_ITERS; ++it)
+    {
+        #pragma omp parallel
+        {
+            for(size_t i = 0; i < N_BINS_COUNT; ++i)
+                pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
+
+            #pragma omp for
+            for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            {
+                const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
+                pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
+            }
+        }
+    }
+
+    auto c2 = std::chrono::steady_clock::now();
+
+    #pragma omp parallel for
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+    {
+        for(size_t j = 1; j < omp_get_max_threads(); ++j)
+            pBins[i] += pBins[j*N_BINS_COUNT + i];
+    }
+
+    std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+
+    size_t total = 0;
+    #pragma omp parallel for reduction(+:total)
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+        total += pBins[i];
+
+    std::cout << "Total Items: " << total << std::endl;
+
+    memkind_free(NULL, pData);
+    memkind_free(NULL, pBins);
+
+    return 0;
+}
+```
+
+### Using HBM Memory (P10-Intel)
+
+Following commands can be used to compile and run example application above
+
+```bash
+ml GCC memkind
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
+g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
+```
+
+Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
+
+## Additional Resources
+
+- [https://linux.die.net/man/8/numactl][1]
+- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
+- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
+
+[1]: https://linux.die.net/man/8/numactl
+[2]: http://memkind.github.io/memkind/man_pages/memkind.html
+[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
\ No newline at end of file
--- a/docs.it4i/cs/guides/horizon.md
+++ b/docs.it4i/cs/guides/horizon.md
+# Using VMware Horizon
+
+VMware Horizon is a virtual desktop infrastructure (VDI) solution
+that enables users to access virtual desktops and applications from any device and any location.
+It provides a comprehensive end-to-end solution for managing and delivering virtual desktops and applications,
+including features such as session management, user authentication, and virtual desktop provisioning.
+
+![](../../img/horizon.png)
+
+## How to Access VMware Horizon
+
+!!! important
+    Access to VMware Horizon requires IT4I VPN.
+
+1. Contact [IT4I support][a] with a request for an access and VM allocation.
+1. [Download][1] and install the VMware Horizon Client for Windows.
+1. Add a new server `https://vdi-cs01.msad.it4i.cz/` in the Horizon client.
+1. Connect to the server using your IT4I username and password.
+   Username is in the `domain\username` format and the domain is `msad.it4i.cz`.
+   For example: `msad.it4i.cz\user123`
+
+## Example
+
+Below is an example of how to mount a remote folder and check the conection on Windows OS:
+
+### Prerequsities
+
+3D applications
+
+* [Blender][3]
+
+SSHFS for remote access
+
+* [sshfs-win][4]
+* [winfsp][5]
+* [shfs-win-manager][6]
+* ssh keys for access to clusters
+
+### Steps
+
+1. Start the VPN and connect to the server via VMware Horizon Client.
+
+    ![](../../img/vmware.png)
+
+1. Mount a remote folder.
+    * Run sshfs-win-manager.
+
+    ![](../../img/sshfs.png)
+
+    * Add a new connection.
+
+    ![](../../img/sshfs1.png)
+
+    * Click on **Connect**.
+
+    ![](../../img/sshfs2.png)
+
+1. Check that the folder is mounted.
+
+    ![](../../img/mount.png)
+
+1. Check the GPU resources.
+
+    ![](../../img/gpu.png)
+
+### Blender
+
+Now if you run, for example, Blender, you can check the available GPU resources in Blender Preferences.
+
+  ![](../../img/blender.png)
+
+[a]: mailto:support@it4i.cz
+
+[1]: https://vdi-cs01.msad.it4i.cz/
+[2]: https://www.paraview.org/download/
+[3]: https://www.blender.org/download/
+[4]: https://github.com/winfsp/sshfs-win/releases
+[5]: https://github.com/winfsp/winfsp/releases/
+[6]: https://github.com/evsar3/sshfs-win-manager/releases
--- a/docs.it4i/cs/guides/power10.md
+++ b/docs.it4i/cs/guides/power10.md
+# Using IBM Power Partition
+
+For testing your application on the IBM Power partition,
+you need to prepare a job script for that partition or use the interactive job:
+
+```console
+scalloc -N 1 -c 192 -A PROJECT-ID -p p07-power --time=08:00:00
+```
+
+where:
+
+- `-N 1` means allocation single node,
+- `-c 192` means allocation 192 cores (threads),
+- `-p p07-power` is IBM Power partition,
+- `--time=08:00:00` means allocation for 8 hours.
+
+On the partition, you should reload the list of modules:
+
+```
+ml architecture/ppc64le
+```
+
+The platform offers both `GNU` based and proprietary IBM toolchains for building applications. IBM also provides optimized BLAS routines library ([ESSL](https://www.ibm.com/docs/en/essl/6.1)), which can be used by both toolchain.
+
+## Building Applications
+
+Our sample application depends on `BLAS`, therefore we start by loading following modules (regardless of which toolchain we want to use):
+
+```
+ml GCC OpenBLAS
+```
+
+### GCC Toolchain
+
+In the case of GCC toolchain we can go ahead and compile the application as usual using either `g++`
+
+```
+g++ -lopenblas hello.cpp -o hello
+```
+
+or `gfortran`
+
+```
+gfortran -lopenblas hello.f90 -o hello
+```
+
+as usual.
+
+### IBM Toolchain
+
+The IBM toolchain requires additional environment setup as it is installed in `/opt/ibm` and is not exposed as a module
+
+```
+IBM_ROOT=/opt/ibm
+OPENXLC_ROOT=$IBM_ROOT/openxlC/17.1.1
+OPENXLF_ROOT=$IBM_ROOT/openxlf/17.1.1
+
+export PATH=$OPENXLC_ROOT/bin:$PATH
+export LD_LIBRARY_PATH=$OPENXLC_ROOT/lib:$LD_LIBRARY_PATH
+
+export PATH=$OPENXLF_ROOT/bin:$PATH
+export LD_LIBRARY_PATH=$OPENXLF_ROOT/lib:$LD_LIBRARY_PATH
+```
+
+from there we can use either `ibm-clang++`
+
+```
+ibm-clang++ -lopenblas hello.cpp -o hello
+```
+
+or `xlf`
+
+```
+xlf -lopenblas hello.f90 -o hello
+```
+
+to build the application as usual.
+
+!!! note
+    Combination of `xlf` and `openblas` seems to cause severe performance degradation. Therefore `ESSL` library should be preferred (see below).
+
+### Using ESSL Library
+
+The [ESSL](https://www.ibm.com/docs/en/essl/6.1) library is installed in `/opt/ibm/math/essl/7.1` so we define additional environment variables
+
+```
+IBM_ROOT=/opt/ibm
+ESSL_ROOT=${IBM_ROOT}math/essl/7.1
+export LD_LIBRARY_PATH=$ESSL_ROOT/lib64:$LD_LIBRARY_PATH
+```
+
+The simplest way to utilize `ESSL` in application, which already uses `BLAS` or `CBLAS` routines is to link with the provided `libessl.so`. This can be done by replacing `-lopenblas` with `-lessl` or `-lessl -lopenblas` (in case `ESSL` does not provide all required `BLAS` routines).
+In practice this can look like
+
+```
+g++ -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.cpp -o hello
+```
+
+or
+
+```
+gfortran -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.f90 -o hello
+```
+
+and similarly for IBM compilers (`ibm-clang++` and `xlf`).
+
+## Hello World Applications
+
+The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
+
+Stationary probability vector estimation in `C++`:
+
+```c++
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include "cblas.h"
+
+const size_t ITERATIONS  = 32;
+const size_t MATRIX_SIZE = 1024;
+
+int main(int argc, char *argv[])
+{
+    const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
+
+    std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+        a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
+    a[0] = 0.5f;
+
+    std::vector<float> w1(matrixElements, 0.0f);
+    std::vector<float> w2(matrixElements, 0.0f);
+
+    std::copy(a.begin(), a.end(), w1.begin());
+
+    std::vector<float> *t1, *t2;
+    t1 = &w1;
+    t2 = &w2;
+
+    auto c1 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < ITERATIONS; ++i)
+    {
+        std::fill(t2->begin(), t2->end(), 0.0f);
+
+        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
+                    1.0f, t1->data(), MATRIX_SIZE,
+                    a.data(), MATRIX_SIZE,
+                    1.0f, t2->data(), MATRIX_SIZE);
+
+        std::swap(t1, t2);
+    }
+
+    auto c2 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+    {
+        std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
+    }
+
+    std::cout << std::endl;
+
+    std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+
+    return 0;
+}
+```
+
+Stationary probability vector estimation in `Fortran`:
+
+```fortran
+program main
+    implicit none
+
+    integer :: matrix_size, iterations
+    integer :: i
+    real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
+    real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
+    real, pointer :: out_data(:), out_diag(:)
+    integer :: cr, cm, c1, c2
+
+    iterations  = 32
+    matrix_size = 1024
+
+    call system_clock(count_rate=cr)
+    call system_clock(count_max=cm)
+
+    allocate(a(matrix_size, matrix_size))
+    allocate(w1(matrix_size, matrix_size))
+    allocate(w2(matrix_size, matrix_size))
+
+    a(:,:) = 1.0 / real(matrix_size)
+    a(:,1) = 0.5 / real(matrix_size - 1)
+    a(1,1) = 0.5
+
+    w1 = a
+    w2(:,:) = 0.0
+
+    t1 => w1
+    t2 => w2
+
+    call system_clock(c1)
+
+    do i = 0, iterations
+        t2(:,:) = 0.0
+
+        call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
+
+        tmp => t1
+        t1  => t2
+        t2  => tmp
+    end do
+
+    call system_clock(c2)
+
+    out_data(1:size(t1)) => t1
+    out_diag => out_data(1::matrix_size+1)
+
+    print *, out_diag
+    print *, "Elapsed Time: ", (c2 - c1) / real(cr)
+
+    deallocate(a)
+    deallocate(w1)
+    deallocate(w2)
+end program main
+```
No results found