Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 2134 additions and 0 deletions
docs.it4i/barbora/img/gpu-v100.png

62.9 KiB

docs.it4i/barbora/img/hdr.jpg

51.9 KiB

docs.it4i/barbora/img/quadrop6000.jpg

33.9 KiB

# Introduction
Welcome to Barbora supercomputer cluster. The Barbora cluster consists of 201 compute nodes, totaling 7232 compute cores with 44544 GB RAM, giving over 848 TFLOP/s theoretical peak performance.
Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2. Read more in [Hardware Overview][1].
The cluster runs with an operating system compatible with the Red Hat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
The user data shared file system and job data shared file system are available to users.
The [Slurm][b] workload manager provides [computing resources allocations and job execution][3].
Read more on how to [apply for resources][4], [obtain login credentials][5] and [access the cluster][6].
![](img/BullSequanaX.png)
[1]: hardware-overview.md
[2]: ../environment-and-modules.md
[3]: ../general/resources-allocation-policy.md
[4]: ../general/applying-for-resources.md
[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[6]: ../general/shell-and-data-access.md
[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
[b]: https://slurm.schedmd.com/
# Network
All of the compute and login nodes of Barbora are interconnected through a [InfiniBand][a] HDR 200 Gbps network and a Gigabit Ethernet network.
Compute nodes and the service infrastructure is connected by the HDR100 technology
that allows one 200 Gbps HDR port (aggregation 4x 50 Gbps) to be divided into two HDR100 ports with 100 Gbps (2x 50 Gbps) bandwidth.
The cabling between the L1 and L2 layer is realized by HDR cabling,
connecting the end devices is realized by so called Y or splitter cable (1x HRD200 - 2x HDR100).
![](img/hdr.jpg)
**The computing network thus implemented fulfills the following parameters**
* 100Gbps
* Latencies less than 10 microseconds (0.6 μs end-to-end, <90ns switch hop)
* Adaptive routing support
* MPI communication support
* IP protocol support (IPoIB)
* Support for SCRATCH Data Storage and NVMe over Fabric Data Storage.
## Mellanox QM8700 40-Ports Switch
**Performance**
* 40x HDR 200 Gb/s ports in a 1U switch
* 80x HDR100 100 Gb/s ports in a 1U switch
* 16 Tb/s aggregate switch throughput
* Up to 15.8 billion messages-per-second
* 90ns switch latency
**Optimized Design**
* 1+1 redundant & hot-swappable power
* 80 gold+ and energy star certified power supplies
* Dual-core x86 CPU
**Advanced Design**
* Adaptive routing
* Collective offloads (Mellanox SHARP technology)
* VL mapping (VL2VL)
![](img/QM8700.jpg)
## BullSequana XH2000 HDRx WH40 MODULE
* Mellanox QM8700 switch modified for direct liquid cooling (Atos Cold Plate), with form factor for installing the Bull Sequana XH2000 rack
![](img/XH2000.png)
[a]: http://en.wikipedia.org/wiki/InfiniBand
# Storage
There are three main shared file systems on the Barbora cluster: [HOME][1], [SCRATCH][2], and [PROJECT][5]. All login and compute nodes may access same data on shared file systems. Compute nodes are also equipped with local (non-shared) scratch, RAM disk, and tmp file systems.
## Archiving
Do not use shared filesystems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][3], which is available via SSHFS.
## Shared Filesystems
Barbora computer provides three main shared filesystems, the [HOME filesystem][1], [SCRATCH filesystem][2], and the [PROJECT filesystems][5].
All filesystems are accessible via the Infiniband network.
The HOME and PROJECT filesystems are realized as NFS filesystem.
The SCRATCH filesystem is realized as a parallel Lustre filesystem.
Extended ACLs are provided on both Lustre filesystems for sharing data with other users using fine-grained control
### Understanding the Lustre Filesystems
A user file on the [Lustre filesystem][a] can be divided into multiple chunks (stripes) and stored across a subset of the object storage targets (OSTs) (disks). The stripes are distributed among the OSTs in a round-robin fashion to ensure load balancing.
When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the [file's stripes][b]. Once the file is opened and the client obtains the striping information, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and OSTs to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency, so that all clients see consistent results.
There is default stripe configuration for Barbora Lustre filesystems. However, users can set the following stripe parameters for their own directories or files to get optimum I/O performance:
1. `stripe_size` the size of the chunk in bytes; specify with k, m, or g to use units of KB, MB, or GB, respectively; the size must be an even multiple of 65,536 bytes; default is 1MB for all Barbora Lustre filesystems
1. `stripe_count` the number of OSTs to stripe across; default is 1 for Barbora Lustre filesystems one can specify -1 to use all OSTs in the filesystem.
1. `stripe_offset` the index of the OST where the first stripe is to be placed; default is -1 which results in random selection; using a non-default value is NOT recommended.
!!! note
Setting stripe size and stripe count correctly for your needs may significantly affect the I/O performance.
Use the `lfs getstripe` command for getting the stripe parameters. Use `lfs setstripe` for setting the stripe parameters to get optimal I/O performance. The correct stripe setting depends on your needs and file access patterns.
```console
$ lfs getstripe dir|filename
$ lfs setstripe -s stripe_size -c stripe_count -o stripe_offset dir|filename
```
Example:
```console
$ lfs getstripe /scratch/projname
$ lfs setstripe -c -1 /scratch/projname
$ lfs getstripe /scratch/projname
```
In this example, we view the current stripe setting of the /scratch/projname/ directory. The stripe count is changed to all OSTs and verified. All files written to this directory will be striped over 5 OSTs
Use `lfs check osts` to see the number and status of active OSTs for each filesystem on Barbora. Learn more by reading the man page:
```console
$ lfs check osts
$ man lfs
```
### Hints on Lustre Stripping
!!! note
Increase the `stripe_count` for parallel I/O to the same file.
When multiple processes are writing blocks of data to the same file in parallel, the I/O performance for large files will improve when the `stripe_count` is set to a larger value. The stripe count sets the number of OSTs to which the file will be written. By default, the stripe count is set to 1. While this default setting provides for efficient access of metadata (for example to support the `ls -l` command), large files should use stripe counts of greater than 1. This will increase the aggregate I/O bandwidth by using multiple OSTs in parallel instead of just one. A rule of thumb is to use a stripe count approximately equal to the number of gigabytes in the file.
Another good practice is to make the stripe count be an integral factor of the number of processes performing the write in parallel, so that you achieve load balance among the OSTs. For example, set the stripe count to 16 instead of 15 when you have 64 processes performing the writes.
!!! note
Using a large stripe size can improve performance when accessing very large files
Large stripe size allows each client to have exclusive access to its own part of a file. However, it can be counterproductive in some cases if it does not match your I/O pattern. The choice of stripe size has no effect on a single-stripe file.
Read more [here][c].
### Lustre on Barbora
The architecture of Lustre on Barbora is composed of two metadata servers (MDS) and two data/object storage servers (OSS).
Configuration of the SCRATCH storage
* 2x Metadata server
* 2x Object storage server
* Lustre object storage
* One disk array NetApp E2800
* 54x 8TB 10kRPM 2,5” SAS HDD
* 5 x RAID6(8+2) OST Object storage target
* 4 hotspare
* Lustre metadata storage
* One disk array NetApp E2600
* 12 300GB SAS 15krpm disks
* 2 groups of 5 disks in RAID5 Metadata target
* 2 hot-spare disks
### HOME File System
The HOME filesystem is mounted in directory /home. Users home directories /home/username reside on this filesystem. Accessible capacity is 28TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 25GB per user. Should 25GB prove insufficient, contact [support][d], the quota may be lifted upon request.
!!! note
The HOME filesystem is intended for preparation, evaluation, processing and storage of data generated by active Projects.
The HOME filesystem should not be used to archive data of past Projects or other unrelated data.
The files on HOME filesystem will not be deleted until the end of the [user's lifecycle][4].
The filesystem is backed up, so that it can be restored in case of a catastrophic failure resulting in significant data loss. However, this backup is not intended to restore old versions of user data or to restore (accidentally) deleted files.
| HOME filesystem | |
| -------------------- | --------------- |
| Accesspoint | /home/username |
| Capacity | 28TB |
| Throughput | 1GB/s |
| User space quota | 25GB |
| User inodes quota | 500K |
| Protocol | NFS |
### SCRATCH File System
The SCRATCH is realized as Lustre parallel file system and is available from all login and computational nodes. There are 5 OSTs dedicated for the SCRATCH file system.
The SCRATCH filesystem is mounted in the `/scratch/project/PROJECT_ID` directory created automatically with the `PROJECT_ID` project. Accessible capacity is 310TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 10TB per user. The purpose of this quota is to prevent runaway programs from filling the entire filesystem and deny service to other users. Should 10TB prove insufficient, contact [support][d], the quota may be lifted upon request.
!!! note
The Scratch filesystem is intended for temporary scratch data generated during the calculation as well as for high-performance access to input and output files. All I/O intensive jobs must use the SCRATCH filesystem as their working directory.
Users are advised to save the necessary data from the SCRATCH filesystem to HOME filesystem after the calculations and clean up the scratch files.
!!! warning
Files on the SCRATCH filesystem that are **not accessed for more than 90 days** will be automatically **deleted**.
The SCRATCH filesystem is realized as Lustre parallel filesystem and is available from all login and computational nodes. Default stripe size is 1MB, stripe count is 1. There are 5 OSTs dedicated for the SCRATCH filesystem.
!!! note
Setting stripe size and stripe count correctly for your needs may significantly affect the I/O performance.
| SCRATCH filesystem | |
| -------------------- | --------- |
| Mountpoint | /scratch |
| Capacity | 310TB |
| Throughput | 5GB/s |
| Throughput [Burst] | 38GB/s |
| User space quota | 10TB |
| User inodes quota | 10M |
| Default stripe size | 1MB |
| Default stripe count | 1 |
| Number of OSTs | 5 |
### PROJECT File System
The PROJECT data storage is a central storage for projects'/users' data on IT4Innovations that is accessible from all clusters.
For more information, see the [PROJECT storage][6] section.
### Disk Usage and Quota Commands
Disk usage and user quotas can be checked and reviewed using the `it4ifsusage` command. You can see an example output [here][9].
To have a better understanding of where the space is exactly used, you can use following command:
```console
$ du -hs dir
```
Example for your HOME directory:
```console
$ cd /home
$ du -hs * .[a-zA-z0-9]* | grep -E "[0-9]*G|[0-9]*M" | sort -hr
258M cuda-samples
15M .cache
13M .mozilla
5,5M .eclipse
2,7M .idb_13.0_linux_intel64_app
```
This will list all directories with MegaBytes or GigaBytes of consumed space in your actual (in this example HOME) directory. List is sorted in descending order from largest to smallest files/directories.
### Extended ACLs
Extended ACLs provide another security mechanism beside the standard POSIX ACLs, which are defined by three entries (for owner/group/others). Extended ACLs have more than the three basic entries. In addition, they also contain a mask entry and may contain any number of named user and named group entries.
ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner.
For more information, see the [Access Control List][7] section of the documentation.
## Local Filesystems
### TMP
Each node is equipped with local /tmp RAMDISK directory. The /tmp directory should be used to work with temporary files. Old files in /tmp directory are automatically purged.
### SCRATCH and RAMDISK
Each node is equipped with RAMDISK storage accessible at /tmp, /lscratch and /ramdisk. The RAMDISK capacity is 180GB. Data placed on RAMDISK occupies the node RAM memory of 192GB total. The RAMDISK directory should only be used to work with temporary files, where very high throughput or I/O performance is required. Old files in RAMDISK directory are automatically purged with job's end.
#### Global RAM Disk
The Global RAM disk spans the local RAM disks of all the allocated nodes within a single job.
For more information, see the [Job Features][8] section.
## Summary
| Mountpoint | Usage | Protocol | Net Capacity | Throughput | Limitations | Access | Services |
| ---------- | ------------------------- | -------- | -------------- | ------------------------------ | ----------- | ----------------------- | ------------------------------- |
| /home | home directory | NFS | 28TB | 1GB/s | Quota 25GB | Compute and login nodes | backed up |
| /scratch | scratch temoporary | Lustre | 310TB | 5GB/s, 30GB/s burst buffer | Quota 10TB | Compute and login nodes |files older 90 days autoremoved |
| /lscratch | local scratch ramdisk | tmpfs | 180GB | 130GB/s | none | Node local | auto purged after job end |
[1]: #home-file-system
[2]: #scratch-file-system
[3]: ../storage/cesnet-storage.md
[4]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[5]: #project-file-system
[6]: ../storage/project-storage.md
[7]: ../storage/standard-file-acl.md
[8]: ../job-features.md#global-ram-disk
[9]: ../storage/project-storage.md#project-quotas
[a]: http://www.nas.nasa.gov
[b]: http://www.nas.nasa.gov/hecc/support/kb/Lustre_Basics_224.html#striping
[c]: http://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace
[d]: https://support.it4i.cz/rt
[e]: http://man7.org/linux/man-pages/man1/nfs4_setfacl.1.html
# Visualization Servers
Remote visualization with [VirtualGL][3] is available on two nodes.
* 2 nodes
* 32 cores in total
* 2x Intel Skylake Gold 6130 – 16-core@2,1 GHz processors per node
* 192 GB DDR4 2667 MT/s of physical memory per node (12x 16 GB)
* BullSequana X450-E5 blade servers
* 2150.4 GFLOP/s per compute node
* 1x 1 GB Ethernet and 2x 10 GB Ethernet
* 1x HDR100 IB port
* 2x SSD 240 GB
![](img/bullsequanaX450-E5.png)
## NVIDIA Quadro P6000
* GPU Memory: 24 GB GDDR5X
* Memory Interface: 384-bit
* Memory Bandwidth: Up to 432 GB/s
* NVIDIA CUDA® Cores: 3840
* System Interface: PCI Express 3.0 x16
* Max Power Consumption: 250 W
* Thermal Solution: Active
* Form Factor: 4.4”H x 10.5” L, Dual Slot, Full Height
* Display Connectors: 4x DP 1.4 + DVI-D DL
* Max Simultaneous Displays: 4 direct, 4 DP1.4 Multi-Stream
* Max DP 1.4 Resolution: 7680 x 4320 @ 30 Hz
* Max DVI-D DL Resolution: 2560 x 1600 @ 60 Hz
* Graphics APIs: Shader Model 5.1, OpenGL 4.5, DirectX 12.0, Vulkan 1.0,
* Compute APIs: CUDA, DirectCompute, OpenCL™
* Floating-Point Performance-Single Precision: 12.6 TFLOP/s, Peak
![](img/quadrop6000.jpg)
## Resource Allocation Policy
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
|-------|----------------|-------------------|-------|-----------|----------|---------------|----------|
| qviz Visualization queue | yes | none required | 2 | 4 | 150 | no | 1h/8h |
## References
* [Graphical User Interface][1]
* [VPN Access][2]
[1]: ../general/shell-and-data-access.md#graphical-user-interface
[2]: ../general/shell-and-data-access.md#vpn-access
[3]: ../software/viz/vgl.md
# e-INFRA CZ Cloud Ostrava
Ostrava cloud consists of 22 nodes from the [Karolina][a] supercomputer.
The cloud site is built on top of OpenStack,
which is a free open standard cloud computing platform.
## Access
To acces the cloud you must:
* have an [e-Infra CZ account][3],
* be a member of an [active project][b].
The dashboard is available at [https://ostrava.openstack.cloud.e-infra.cz/][6].
You can specify resources/quotas for your project.
For more information, see the [Quota Limits][5] section.
## Creating First Instance
To create your first VM instance, follow the [e-INFRA CZ guide][4].
Note that the guide is similar for clouds in Brno and Ostrava,
so make sure that you follow steps for Ostrava cloud where applicable.
### Process Automatization
You can automate the process using Terraform or Openstack.
#### Terraform
Prerequisites:
* Linux/Mac/WSL terminal BASH shell
* installed Terraform and sshuttle
* downloaded [application credentials][9] from OpenStack Horizon dashboard and saved as a `project_openrc.sh.inc` text file
Follow the guide: [https://code.it4i.cz/terraform][8]
#### OpenStack
Prerequisites:
* Linux/Mac/WSL terminal BASH shell
* installed [OpenStack client][7]
Follow the guide: [https://code.it4i.cz/commandline][10]
Run commands:
```console
source project_openrc.sh.inc
```
```console
./cmdline-demo.sh basic-infrastructure-1
```
## Technical Reference
For the list of deployed OpenStack services, see the [list of components][1].
More information can be found on the [e-INFRA CZ website][2].
[1]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/openstack-components/
[2]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/
[3]: https://docs.e-infra.cz/account/
[4]: https://docs.e-infra.cz/compute/openstack/getting-started/creating-first-infrastructure/
[5]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-g2-site/quota-limits/
[6]: https://ostrava.openstack.cloud.e-infra.cz/
[7]: https://docs.fuga.cloud/how-to-use-the-openstack-cli-tools-on-linux
[8]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/terraform
[9]: https://docs.e-infra.cz/compute/openstack/how-to-guides/obtaining-api-key/
[10]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/commandline
[a]: ../karolina/introduction.md
[b]: ../general/access/project-access.md
# IT4I Cloud
IT4I cloud consists of 14 nodes from the [Karolina][a] supercomputer.
The cloud site is built on top of OpenStack,
which is a free open standard cloud computing platform.
!!! Note
The guide describes steps for personal projects.<br>
Some steps may differ for large projects.<br>
For large project, apply for resources to the [Allocation Committee][11].
## Access
To access the cloud you must be a member of an active EUROHPC project,
or fall into the **Access Category B**, i.e. [Access For Thematic HPC Resource Utilisation][11].
A personal OpenStack project is required. Request one by contacting [IT4I Support][12].
The dashboard is available at [https://cloud.it4i.cz][6].
You can see quotas set for the IT4I Cloud in the [Quota Limits][f] section.
## Creating First Instance
To create your first VM instance, follow the steps below:
### Log In
Go to [https://cloud.it4i.cz][6], enter your LDAP username and password and choose the `IT4I_LDAP` domain. After you sign in, you will be redirected to the dashboard.
![](../img/login.png)
### Create Key Pair
SSH key is required for remote access to your instance.
1. Go to **Project > Compute > Key Pairs** and click the **Create Key Pair** button.
![](../img/keypairs.png)
1. In the Create Key Pair window, name your key pair, select `SSH Key` for key type and confirm by clicking Create Key Pair.
![](../img/keypairs1.png)
1. Download and manage the private key according to your operating system.
### Update Security Group
To be able to remotely access your VM instance, you have to allow access in the security group.
1. Go to **Project > Network > Security Groups** and click on **Manage Rules**, for the default security group.
![](../img/securityg.png)
1. Click on **Add Rule**, choose **SSH**, and leave the remaining fields unchanged.
![](../img/securityg1.png)
### Create VM Instance
1. In **Compute > Instances**, click **Launch Instance**.
![](../img/instance.png)
1. Choose Instance Name, Description, and number of instances. Click **Next**.
![](../img/instance1.png)
1. Choose an image from which to boot the instance. Choose to delete the volume after instance delete. Click **Next**.
![](../img/instance2.png)
1. Choose the hardware resources of the instance by selecting a flavor. Additional volumes for data can be attached later on. Click **Next**.
![](../img/instance3.png)
1. Select the network and continue to **Security Groups**.
![](../img/instance4.png)
1. Allocate the security group with SSH rule that you added in the [Update Security Group](it4i-cloud.md#update-security-group) step. Then click **Next** to go to the **Key Pair**.
![](../img/securityg2.png)
1. Select the key that you created in the [Create Key Pair][g] section and launch the instance.
![](../img/instance5.png)
### Associate Floating IP
1. Click on the **Associate** button next to the floating IP.
![](../img/floatingip.png)
1. Select Port to be associated with the instance, then click the **Associate** button.
Now you can join the VM using your preferred SSH client.
## Process Automatization
You can automate the process using Openstack.
### OpenStack
Prerequisites:
* Linux/Mac/WSL terminal BASH shell
* installed [OpenStack client][7]
Follow the guide: [https://code.it4i.cz/commandline][10]
Run commands:
```console
source project_openrc.sh.inc
```
```console
./cmdline-demo.sh basic-infrastructure-1
```
[1]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/openstack-components/
[2]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-site/
[3]: https://docs.e-infra.cz/account/
[4]: https://docs.e-infra.cz/compute/openstack/getting-started/creating-first-infrastructure/
[5]: https://docs.e-infra.cz/compute/openstack/technical-reference/ostrava-g2-site/quota-limits/
[6]: https://cloud.it4i.cz
[7]: https://docs.fuga.cloud/how-to-use-the-openstack-cli-tools-on-linux
[8]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/terraform
[9]: https://docs.e-infra.cz/compute/openstack/how-to-guides/obtaining-api-key/
[10]: https://code.it4i.cz/dvo0012/infrastructure-by-script/-/tree/main/openstack-infrastructure-as-code-automation/clouds/g2/ostrava/general/commandline
[11]: https://www.it4i.cz/en/for-users/computing-resources-allocation
[12]: mailto:support@it4i.cz @@
[a]: ../karolina/introduction.md
[b]: ../general/access/project-access.md
[c]: einfracz-cloud.md
[d]: ../general/accessing-the-clusters/vpn-access.md
[e]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[f]: it4i-quotas.md
[g]: it4i-cloud.md#create-key-pair
# IT4I Cloud Quotas
| Resource | Quota |
|---------------------------------------|-------|
| Instances | 10 |
| VCPUs | 20 |
| RAM | 32GB |
| Volumes | 20 |
| Volume Snapshots | 12 |
| Volume Storage | 500 |
| Floating-IPs | 1 |
| Security Groups | 10 |
| Security Group Rules | 100 |
| Networks | 1 |
| Ports | 10 |
| Routers | 1 |
| Backups | 12 |
| Groups | 10 |
| rbac_policies | 10 |
| Subnets | 1 |
| Subnet_pools | -1 |
| Fixed-ips | -1 |
| Injected-file-size | 10240 |
| Injected-path-size | 255 |
| Injected-files | 5 |
| Key-pairs | 100 |
| Properties | 128 |
| Server-groups | 10 |
| Server-group-members | 10 |
| Backup-gigabytes | 1002 |
| Per-volume-gigabytes | -1 |
host: irods.it4i.cz
port: 1247
proxy_user: some_user
client_user: some_user
zone: IT4I
authscheme: "pam"
ssl_ca_cert_file: "~/.irods/chain_geant_ov_rsa_ca_4_full.pem"
ssl_encryption_key_size: 32
ssl_encryption_algorithm: "AES-256-CBC"
ssl_encryption_salt_size: 8
ssl_encryption_hash_rounds: 16
path_mappings:
- irods_path: /IT4I/home/some_user
mapping_path: /
resource_type: dir
# Accessing Complementary Systems
Complementary systems can be accessed at `login.cs.it4i.cz`
by any user with an active account assigned to an active project.
**SSH is required** to access Complementary systems.
## Data Storage
### Home
The `/home` file system is shared across all Complementary systems. Note that this file system is **not** shared with the file system on IT4I clusters.
### Scratch
There are local `/lscratch` storages on individual nodes.
### PROJECT
Complementary systems are connected to the [PROJECT storage][1].
[1]: ../storage/project-storage.md
# Using AMD Partition
For testing your application on the AMD partition,
you need to prepare a job script for that partition or use the interactive job:
```console
salloc -N 1 -c 64 -A PROJECT-ID -p p03-amd --gres=gpu:4 --time=08:00:00
```
where:
- `-N 1` means allocating one server,
- `-c 64` means allocating 64 cores,
- `-A` is your project,
- `-p p03-amd` is AMD partition,
- `--gres=gpu:4` means allocating all 4 GPUs of the node,
- `--time=08:00:00` means allocation for 8 hours.
You have also an option to allocate subset of the resources only,
by reducing the `-c` and `--gres=gpu` to smaller values.
```console
salloc -N 1 -c 48 -A PROJECT-ID -p p03-amd --gres=gpu:3 --time=08:00:00
salloc -N 1 -c 32 -A PROJECT-ID -p p03-amd --gres=gpu:2 --time=08:00:00
salloc -N 1 -c 16 -A PROJECT-ID -p p03-amd --gres=gpu:1 --time=08:00:00
```
!!! Note
p03-amd01 server has hyperthreading **enabled** therefore htop shows 128 cores.<br>
p03-amd02 server has hyperthreading **disabled** therefore htop shows 64 cores.
## Using AMD MI100 GPUs
The AMD GPUs can be programmed using the [ROCm open-source platform](https://docs.amd.com/).
ROCm and related libraries are installed directly in the system.
You can find it here:
```console
/opt/rocm/
```
The actual version can be found here:
```console
[user@p03-amd02.cs]$ cat /opt/rocm/.info/version
5.5.1-74
```
## Basic HIP Code
The first way how to program AMD GPUs is to use HIP.
The basic vector addition code in HIP looks like this.
This a full code and you can copy and paste it into a file.
For this example we use `vector_add.hip.cpp`.
```console
#include <cstdio>
#include <hip/hip_runtime.h>
__global__ void add_vectors(float * x, float * y, float alpha, int count)
{
long long idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < count)
y[idx] += alpha * x[idx];
}
int main()
{
// number of elements in the vectors
long long count = 10;
// allocation and initialization of data on the host (CPU memory)
float * h_x = new float[count];
float * h_y = new float[count];
for(long long i = 0; i < count; i++)
{
h_x[i] = i;
h_y[i] = 10 * i;
}
// print the input data
printf("X:");
for(long long i = 0; i < count; i++)
printf(" %7.2f", h_x[i]);
printf("\n");
printf("Y:");
for(long long i = 0; i < count; i++)
printf(" %7.2f", h_y[i]);
printf("\n");
// allocation of memory on the GPU device
float * d_x;
float * d_y;
hipMalloc(&d_x, count * sizeof(float));
hipMalloc(&d_y, count * sizeof(float));
// copy the data from host memory to the device
hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
int tpb = 256;
int bpg = (count - 1) / tpb + 1;
// launch the kernel on the GPU
add_vectors<<< bpg, tpb >>>(d_x, d_y, 100, count);
// hipLaunchKernelGGL(add_vectors, bpg, tpb, 0, 0, d_x, d_y, 100, count);
// copy the result back to CPU memory
hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
// print the results
printf("Y:");
for(long long i = 0; i < count; i++)
printf(" %7.2f", h_y[i]);
printf("\n");
// free the allocated memory
hipFree(d_x);
hipFree(d_y);
delete[] h_x;
delete[] h_y;
return 0;
}
```
To compile the code we use `hipcc` compiler.
For compiler information, use `hipcc --version`:
```console
[user@p03-amd02.cs ~]$ hipcc --version
HIP version: 5.5.30202-eaf00c0b
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.5.1 23194 69ef12a7c3cc5b0ccf820bc007bd87e8b3ac3037)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.5.1/llvm/bin
```
The code is compiled a follows:
```console
hipcc vector_add.hip.cpp -o vector_add.x
```
The correct output of the code is:
```console
[user@p03-amd02.cs ~]$ ./vector_add.x
X: 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00
Y: 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00
Y: 0.00 110.00 220.00 330.00 440.00 550.00 660.00 770.00 880.00 990.00
```
More details on HIP programming is in the [HIP Programming Guide](https://docs.amd.com/bundle/HIP-Programming-Guide-v5.5/page/Introduction_to_HIP_Programming_Guide.html)
## HIP and ROCm Libraries
The list of official AMD libraries can be found [here](https://docs.amd.com/category/libraries).
The libraries are installed in the same directory is ROCm
```console
/opt/rocm/
```
Following libraries are installed:
```console
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipblas
drwxr-xr-x 3 root root 17 Jun 7 14:09 hipblas-clients
drwxr-xr-x 3 root root 29 Jun 7 14:09 hipcub
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipfft
drwxr-xr-x 3 root root 25 Jun 7 14:09 hipfort
drwxr-xr-x 4 root root 32 Jun 7 14:09 hiprand
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsolver
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsparse
```
and
```console
drwxr-xr-x 4 root root 32 Jun 7 14:09 rocalution
drwxr-xr-x 4 root root 44 Jun 7 14:09 rocblas
drwxr-xr-x 4 root root 44 Jun 7 14:09 rocfft
drwxr-xr-x 4 root root 32 Jun 7 14:09 rocprim
drwxr-xr-x 4 root root 32 Jun 7 14:09 rocrand
drwxr-xr-x 4 root root 44 Jun 7 14:09 rocsolver
drwxr-xr-x 4 root root 44 Jun 7 14:09 rocsparse
drwxr-xr-x 3 root root 29 Jun 7 14:09 rocthrust
```
## Using HipBlas Library
The basic code in HIP that uses hipBlas looks like this.
This a full code and you can copy and paste it into a file.
For this example we use `hipblas.hip.cpp`.
```console
#include <cstdio>
#include <vector>
#include <cstdlib>
#include <hip/hip_runtime.h>
#include <hipblas/hipblas.h>
int main()
{
srand(9600);
int width = 10;
int height = 7;
int elem_count = width * height;
// initialization of data in CPU memory
float * h_A;
hipHostMalloc(&h_A, elem_count * sizeof(*h_A));
for(int i = 0; i < elem_count; i++)
h_A[i] = (100.0f * rand()) / (float)RAND_MAX;
printf("Matrix A:\n");
for(int r = 0; r < height; r++)
{
for(int c = 0; c < width; c++)
printf("%6.3f ", h_A[r + height * c]);
printf("\n");
}
float * h_x;
hipHostMalloc(&h_x, width * sizeof(*h_x));
for(int i = 0; i < width; i++)
h_x[i] = (100.0f * rand()) / (float)RAND_MAX;
printf("vector x:\n");
for(int i = 0; i < width; i++)
printf("%6.3f ", h_x[i]);
printf("\n");
float * h_y;
hipHostMalloc(&h_y, height * sizeof(*h_y));
for(int i = 0; i < height; i++)
h_x[i] = 100.0f + i;
printf("vector y:\n");
for(int i = 0; i < height; i++)
printf("%6.3f ", h_x[i]);
printf("\n");
// initialization of data in GPU memory
float * d_A;
size_t pitch_A;
hipMallocPitch((void**)&d_A, &pitch_A, height * sizeof(*d_A), width);
hipMemcpy2D(d_A, pitch_A, h_A, height * sizeof(*d_A), height * sizeof(*d_A), width, hipMemcpyHostToDevice);
int lda = pitch_A / sizeof(float);
float * d_x;
hipMalloc(&d_x, width * sizeof(*d_x));
hipMemcpy(d_x, h_x, width * sizeof(*d_x), hipMemcpyHostToDevice);
float * d_y;
hipMalloc(&d_y, height * sizeof(*d_y));
hipMemcpy(d_y, h_y, height * sizeof(*d_y), hipMemcpyHostToDevice);
// basic calculation of the result on the CPU
float alpha=2.0f, beta=10.0f;
for(int i = 0; i < height; i++)
h_y[i] *= beta;
for(int r = 0; r < height; r++)
for(int c = 0; c < width; c++)
h_y[r] += alpha * h_x[c] * h_A[r + height * c];
printf("result y CPU:\n");
for(int i = 0; i < height; i++)
printf("%6.3f ", h_y[i]);
printf("\n");
// calculation of the result on the GPU using the hipBLAS library
hipblasHandle_t blas_handle;
hipblasCreate(&blas_handle);
hipblasSgemv(blas_handle, HIPBLAS_OP_N, height, width, &alpha, d_A, lda, d_x, 1, &beta, d_y, 1);
hipDeviceSynchronize();
hipblasDestroy(blas_handle);
// copy the GPU result to CPU memory and print it
hipMemcpy(h_y, d_y, height * sizeof(*d_y), hipMemcpyDeviceToHost);
printf("result y BLAS:\n");
for(int i = 0; i < height; i++)
printf("%6.3f ", h_y[i]);
printf("\n");
// free all the allocated memory
hipFree(d_A);
hipFree(d_x);
hipFree(d_y);
hipHostFree(h_A);
hipHostFree(h_x);
hipHostFree(h_y);
return 0;
}
```
The code compilation can be done as follows:
```console
hipcc hipblas.hip.cpp -o hipblas.x -lhipblas
```
## Using HipSolver Library
The basic code in HIP that uses hipSolver looks like this.
This a full code and you can copy and paste it into a file.
For this example we use `hipsolver.hip.cpp`.
```console
#include <cstdio>
#include <vector>
#include <cstdlib>
#include <algorithm>
#include <hipsolver/hipsolver.h>
#include <hipblas/hipblas.h>
int main()
{
srand(63456);
int size = 10;
// allocation and initialization of data on host. this time we use std::vector
int h_A_ld = size;
int h_A_pitch = h_A_ld * sizeof(float);
std::vector<float> h_A(size * h_A_ld);
for(int r = 0; r < size; r++)
for(int c = 0; c < size; c++)
h_A[r * h_A_ld + c] = (10.0 * rand()) / RAND_MAX;
printf("System matrix A:\n");
for(int r = 0; r < size; r++)
{
for(int c = 0; c < size; c++)
printf("%6.3f ", h_A[r * h_A_ld + c]);
printf("\n");
}
std::vector<float> h_b(size);
for(int i = 0; i < size; i++)
h_b[i] = (10.0 * rand()) / RAND_MAX;
printf("RHS vector b:\n");
for(int i = 0; i < size; i++)
printf("%6.3f ", h_b[i]);
printf("\n");
std::vector<float> h_x(size);
// memory allocation on the device and initialization
float * d_A;
size_t d_A_pitch;
hipMallocPitch((void**)&d_A, &d_A_pitch, size, size);
int d_A_ld = d_A_pitch / sizeof(float);
float * d_b;
hipMalloc(&d_b, size * sizeof(float));
float * d_x;
hipMalloc(&d_x, size * sizeof(float));
int * d_piv;
hipMalloc(&d_piv, size * sizeof(int));
int * info;
hipMallocManaged(&info, sizeof(int));
hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b.data(), size * sizeof(float), hipMemcpyHostToDevice);
// solving the system using hipSOLVER
hipsolverHandle_t solverHandle;
hipsolverCreate(&solverHandle);
int wss_trf, wss_trs; // wss = WorkSpace Size
hipsolverSgetrf_bufferSize(solverHandle, size, size, d_A, d_A_ld, &wss_trf);
hipsolverSgetrs_bufferSize(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, &wss_trs);
float * workspace;
int wss = std::max(wss_trf, wss_trs);
hipMalloc(&workspace, wss * sizeof(float));
hipsolverSgetrf(solverHandle, size, size, d_A, d_A_ld, workspace, wss, d_piv, info);
hipsolverSgetrs(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, workspace, wss, info);
hipMemcpy(d_x, d_b, size * sizeof(float), hipMemcpyDeviceToDevice);
hipMemcpy(h_x.data(), d_x, size * sizeof(float), hipMemcpyDeviceToHost);
printf("Solution vector x:\n");
for(int i = 0; i < size; i++)
printf("%6.3f ", h_x[i]);
printf("\n");
hipFree(workspace);
hipsolverDestroy(solverHandle);
// perform matrix-vector multiplication A*x using hipBLAS to check if the solution is correct
hipblasHandle_t blasHandle;
hipblasCreate(&blasHandle);
float alpha = 1;
float beta = 0;
hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
hipblasSgemv(blasHandle, HIPBLAS_OP_N, size, size, &alpha, d_A, d_A_ld, d_x, 1, &beta, d_b, 1);
hipDeviceSynchronize();
hipblasDestroy(blasHandle);
for(int i = 0; i < size; i++)
h_b[i] = 0;
hipMemcpy(h_b.data(), d_b, size * sizeof(float), hipMemcpyDeviceToHost);
printf("Check multiplication vector Ax:\n");
for(int i = 0; i < size; i++)
printf("%6.3f ", h_b[i]);
printf("\n");
// free all the allocated memory
hipFree(info);
hipFree(d_piv);
hipFree(d_x);
hipFree(d_b);
hipFree(d_A);
return 0;
}
```
The code compilation can be done as follows:
```console
hipcc hipsolver.hip.cpp -o hipsolver.x -lhipblas -lhipsolver
```
## Using OpenMP Offload to Program AMD GPUs
The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard
and a subset of the OpenMP 5.0 standard.
Fortran, C/C++ compilers, and corresponding runtime libraries are included.
The OpenMP toolchain is automatically installed as part of the standard ROCm installation
and is available under `/opt/rocm/llvm`. The sub-directories are:
- `bin` : Compilers (flang and clang) and other binaries.
- `examples` : The usage section below shows how to compile and run these programs.
- `include` : Header files.
- `lib` : Libraries including those required for target offload.
- `lib-debug` : Debug versions of the above libraries.
More information can be found in the [AMD OpenMP Support Guide](https://docs.amd.com/bundle/OpenMP-Support-Guide-v5.5/page/Introduction_to_OpenMP_Support_Guide.html).
## Compilation of OpenMP Code
Basic example that uses OpenMP offload is here.
Again, code is complete and can be copied and pasted into a file.
Here we use `vadd.cpp`.
```console
#include <cstdio>
#include <cstdlib>
int main(int argc, char ** argv)
{
long long count = 1 << 20;
if(argc > 1)
count = atoll(argv[1]);
long long print_count = 16;
if(argc > 2)
print_count = atoll(argv[2]);
long long * a = new long long[count];
long long * b = new long long[count];
long long * c = new long long[count];
#pragma omp parallel for
for(long long i = 0; i < count; i++)
{
a[i] = i;
b[i] = 10 * i;
}
printf("A: ");
for(long long i = 0; i < print_count; i++)
printf("%3lld ", a[i]);
printf("\n");
printf("B: ");
for(long long i = 0; i < print_count; i++)
printf("%3lld ", b[i]);
printf("\n");
#pragma omp target map(to: a[0:count],b[0:count]) map(from: c[0:count])
#pragma omp teams distribute parallel for
for(long long i = 0; i < count; i++)
{
c[i] = a[i] + b[i];
}
printf("C: ");
for(long long i = 0; i < print_count; i++)
printf("%3lld ", c[i]);
printf("\n");
delete[] a;
delete[] b;
delete[] c;
return 0;
}
```
This code can be compiled like this:
```console
/opt/rocm/llvm/bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 vadd.cpp -o vadd.x
```
These options are required for target offload from an OpenMP program:
- `-target x86_64-pc-linux-gnu`
- `-fopenmp`
- `-fopenmp-targets=amdgcn-amd-amdhsa`
- `-Xopenmp-target=amdgcn-amd-amdhsa`
This flag specifies the GPU architecture of targeted GPU.
You need to chage this when moving for instance to LUMI with MI250X GPU.
The MI100 GPUs presented in CS have code `gfx908`:
- `-march=gfx908`
Note: You also have to include the `O0`, `O2`, `O3` or `O3` flag.
Without this flag the execution of the compiled code fails.
# Using ARM Partition
For testing your application on the ARM partition,
you need to prepare a job script for that partition or use the interactive job:
```
salloc -A PROJECT-ID -p p01-arm
```
On the partition, you should reload the list of modules:
```
ml architecture/aarch64
```
For compilation, `gcc` and `OpenMPI` compilers are available.
Hence, the compilation process should be the same as on the `x64` architecture.
Let's have the following `hello world` example:
```
#include "mpi.h"
#include "omp.h"
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel
{
printf("Hello on rank %d, thread %d\n", rank, omp_get_thread_num());
}
MPI_Finalize();
}
```
You can compile and run the example:
```
ml OpenMPI/4.1.4-GCC-11.3.0
mpic++ -fopenmp hello.cpp -o hello
mpirun -n 4 ./hello
```
Please see [gcc options](https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html) for more advanced compilation settings.
No complications are expected as long as the application does not use any intrinsic for `x64` architecture.
If you want to use intrinsic,
[SVE](https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics) instruction set is available.
# Using NVIDIA Grace Partition
For testing your application on the NVIDIA Grace Partition,
you need to prepare a job script for that partition or use the interactive job:
```console
salloc -N 1 -c 144 -A PROJECT-ID -p p11-grace --time=08:00:00
```
where:
- `-N 1` means allocation single node,
- `-c 144` means allocation 144 cores,
- `-p p11-grace` is NVIDIA Grace partition,
- `--time=08:00:00` means allocation for 8 hours.
## Available Toolchains
The platform offers three toolchains:
- Standard GCC (as a module `ml GCC`)
- [NVHPC](https://developer.nvidia.com/hpc-sdk) (as a module `ml NVHPC`)
- [Clang for NVIDIA Grace](https://developer.nvidia.com/grace/clang) (installed in `/opt/nvidia/clang`)
!!! note
The NVHPC toolchain showed strong results with minimal amount of tuning necessary in our initial evaluation.
### GCC Toolchain
The GCC compiler seems to struggle with vectorization of short (constant length) loops, which tend to get completely unrolled/eliminated instead of being vectorized. For example simple nested loop such as
```cpp
for(int i = 0; i < 1000000; ++i) {
// Iterations dependent in "i"
// ...
for(int j = 0; j < 8; ++j) {
// but independent in "j"
// ...
}
}
```
may emit scalar code for the inner loop leading to no vectorization being used at all.
### Clang (For Grace) Toolchain
The Clang/LLVM tends to behave similarly, but can be guided to properly vectorize the inner loop with either flags `-O3 -ffast-math -march=native -fno-unroll-loops -mllvm -force-vector-width=8` or pragmas such as `#pragma clang loop vectorize_width(8)` and `#pragma clang loop unroll(disable)`.
```cpp
for(int i = 0; i < 1000000; ++i) {
// Iterations dependent in "i"
// ...
#pragma clang loop unroll(disable) vectorize_width(8)
for(int j = 0; j < 8; ++j) {
// but independent in "j"
// ...
}
}
```
!!! note
Our basic experiments show that fixed width vectorization (NEON) tends to perform better in the case of short (register-length) loops than SVE. In cases (like above), where specified `vectorize_width` is larger than availiable vector unit width, Clang will emit multiple NEON instructions (eg. 4 instructions will be emitted to process 8 64-bit operations in 128-bit units of Grace).
### NVHPC Toolchain
The NVHPC toolchain handled aforementioned case without any additional tuning. Simple `-O3 -march=native -fast` should be therefore sufficient.
## Basic Math Libraries
The basic libraries (BLAS and LAPACK) are included in NVHPC toolchain and can be used simply as `-lblas` and `-llapack` for BLAS and LAPACK respectively (`lp64` and `ilp64` versions are also included).
!!! note
The Grace platform doesn't include CUDA-capable GPU, therefore `nvcc` will fail with an error. This means that `nvc`, `nvc++` and `nvfortran` should be used instead.
### NVIDIA Performance Libraries
The [NVPL](https://developer.nvidia.com/nvpl) package includes more extensive set of libraries in both sequential and multi-threaded versions:
- BLACS: `-lnvpl_blacs_{lp64,ilp64}_{mpich,openmpi3,openmpi4,openmpi5}`
- BLAS: `-lnvpl_blas_{lp64,ilp64}_{seq,gomp}`
- FFTW: `-lnvpl_fftw`
- LAPACK: `-lnvpl_lapack_{lp64,ilp64}_{seq,gomp}`
- ScaLAPACK: `-lnvpl_scalapack_{lp64,ilp64}`
- RAND: `-lnvpl_rand` or `-lnvpl_rand_mt`
- SPARSE: `-lnvpl_sparse`
This package should be compatible with all availiable toolchains and includes CMake module files for easy integration into CMake-based projects. For further documentation see also [NVPL](https://docs.nvidia.com/nvpl).
### Recommended BLAS Library
We recommend to use the multi-threaded BLAS library from the NVPL package.
!!! note
It is important to pin the processes using **OMP_PROC_BIND=spread**
Example:
```console
$ ml NVHPC
$ nvc -O3 -march=native myprog.c -o myprog -lnvpl_blas_lp64_gomp
$ OMP_PROC_BIND=spread ./myprog
```
## Basic Communication Libraries
The OpenMPI 4 implementation is included with NVHPC toolchain and is exposed as a module (`ml OpenMPI`). The following example
```cpp
#include <mpi.h>
#include <sched.h>
#include <omp.h>
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel
{
printf("Hello on rank %d, thread %d on CPU %d\n", rank, omp_get_thread_num(), sched_getcpu());
}
MPI_Finalize();
}
```
can be compiled and run as follows
```console
ml OpenMPI
mpic++ -fast -fopenmp hello.cpp -o hello
OMP_PROC_BIND=close OMP_NUM_THREADS=4 mpirun -np 4 --map-by slot:pe=36 ./hello
```
In this configuration we run 4 ranks bound to one quarter of cores each with 4 OpenMP threads.
## Simple BLAS Application
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in `C++`:
```cpp
#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"
const size_t ITERATIONS = 32;
const size_t MATRIX_SIZE = 1024;
int main(int argc, char *argv[])
{
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
for(size_t i = 0; i < MATRIX_SIZE; ++i)
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
a[0] = 0.5f;
std::vector<float> w1(matrixElements, 0.0f);
std::vector<float> w2(matrixElements, 0.0f);
std::copy(a.begin(), a.end(), w1.begin());
std::vector<float> *t1, *t2;
t1 = &w1;
t2 = &w2;
auto c1 = std::chrono::steady_clock::now();
for(size_t i = 0; i < ITERATIONS; ++i)
{
std::fill(t2->begin(), t2->end(), 0.0f);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
1.0f, t1->data(), MATRIX_SIZE,
a.data(), MATRIX_SIZE,
1.0f, t2->data(), MATRIX_SIZE);
std::swap(t1, t2);
}
auto c2 = std::chrono::steady_clock::now();
for(size_t i = 0; i < MATRIX_SIZE; ++i)
{
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
}
std::cout << std::endl;
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
return 0;
}
```
Stationary probability vector estimation in `Fortran`:
```fortran
program main
implicit none
integer :: matrix_size, iterations
integer :: i
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
real, pointer :: out_data(:), out_diag(:)
integer :: cr, cm, c1, c2
iterations = 32
matrix_size = 1024
call system_clock(count_rate=cr)
call system_clock(count_max=cm)
allocate(a(matrix_size, matrix_size))
allocate(w1(matrix_size, matrix_size))
allocate(w2(matrix_size, matrix_size))
a(:,:) = 1.0 / real(matrix_size)
a(:,1) = 0.5 / real(matrix_size - 1)
a(1,1) = 0.5
w1 = a
w2(:,:) = 0.0
t1 => w1
t2 => w2
call system_clock(c1)
do i = 0, iterations
t2(:,:) = 0.0
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
tmp => t1
t1 => t2
t2 => tmp
end do
call system_clock(c2)
out_data(1:size(t1)) => t1
out_diag => out_data(1::matrix_size+1)
print *, out_diag
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
deallocate(a)
deallocate(w1)
deallocate(w2)
end program main
```
### Using NVHPC Toolchain
The C++ version of the example can be compiled with NVHPC and ran as follows
```console
ml NVHPC
nvc++ -O3 -march=native -fast -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lblas main.cpp -o main
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
```
The Fortran version is just as simple:
```console
ml NVHPC
nvfortran -O3 -march=native -fast -lblas main.f90 -o main.x
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./main
```
!!! note
It may be advantageous to use NVPL libraries instead NVHPC ones. For example DGEMM BLAS 3 routine from NVPL is almost 30% faster than NVHPC one.
### Using Clang (For Grace) Toolchain
Similarly Clang for Grace toolchain with NVPL BLAS can be used to compile C++ version of the example.
```console
ml NVHPC
/opt/nvidia/clang/17.23.11/bin/clang++ -O3 -march=native -ffast-math -I$NVHPC/Linux_aarch64/$EBVERSIONNVHPC/compilers/include/lp64 -lnvpl_blas_lp64_gomp main.cpp -o main
```
!!! note
NVHPC module is used just for the `cblas.h` include in this case. This can be avoided by changing the code to use `nvpl_blas.h` instead.
## Additional Resources
- [https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/][1]
- [https://developer.nvidia.com/hpc-sdk][2]
- [https://developer.nvidia.com/grace/clang][3]
- [https://docs.nvidia.com/nvpl][4]
[1]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/
[2]: https://developer.nvidia.com/hpc-sdk
[3]: https://developer.nvidia.com/grace/clang
[4]: https://docs.nvidia.com/nvpl
# Heterogeneous Memory Management on Intel Platforms
Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
## Overview
At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
```bash
ml memkind
```
### Process Level (NUMACTL)
The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
```bash
numactl --membind <node_ids_set>
```
or select single preffered node
```bash
numactl --preffered <node_id>
```
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check `numactl` configuration is
```bash
numactl -s
```
which prints configuration in its execution environment eg.
```bash
numactl --membind 8-15 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
```
The last row shows allocations memory are restricted to NUMA nodes `8-15`.
### Allocation Level (MEMKIND)
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
```cpp
void *pData = malloc(<SIZE>);
/* ... */
free(pData);
```
with
```cpp
#include <memkind.h>
void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address
```
Similarly other memory types can be chosen.
!!! note
The allocation will return `NULL` pointer when memory of specified kind is not available.
## High Bandwidth Memory (HBM)
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
```bash
numactl -H
```
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
![](../../img/cs/guides/p10_numa_sc4_flat.png)
### Process Level
With this we can easily restrict application to DDR DRAM or HBM memory:
```bash
# Only DDR DRAM
numactl --membind 0-7 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 369745.8 0.043355 0.043273 0.043588
Scale: 366989.8 0.043869 0.043598 0.045355
Add: 378054.0 0.063652 0.063483 0.063899
Triad: 377852.5 0.063621 0.063517 0.063884
# Only HBM
numactl --membind 8-15 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 1128430.1 0.015214 0.014179 0.015615
Scale: 1045065.2 0.015814 0.015310 0.016309
Add: 1096992.2 0.022619 0.021878 0.024182
Triad: 1065152.4 0.023449 0.022532 0.024559
```
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
```bash
#!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
```
and can be used as
```bash
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
```
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
```bash
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
```
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
![](../../img/cs/guides/p10_stream_dram.png)
![](../../img/cs/guides/p10_stream_hbm.png)
### Allocation Level
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
```cpp
#include <memkind.h>
// ...
STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
// ...
memkind_free(NULL, a);
memkind_free(NULL, b);
memkind_free(NULL, c);
```
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library
```bash
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
```
and can be run as
```bash
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
```
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
![](../../img/cs/guides/p10_stream_memkind.png)
With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
## Simple Application
One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
```cpp
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
#include <cstring>
#include <omp.h>
#include <memkind.h>
const size_t N_DATA_SIZE = 2 * 1024 * 1024 * 1024ull;
const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
const size_t N_ITERS = 10;
#if defined(HBM)
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_HBW_ALL
#else
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_REGULAR
#endif
int main(int argc, char *argv[])
{
const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
#pragma omp parallel
{
drand48_data state;
srand48_r(omp_get_thread_num(), &state);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
drand48_r(&state, &pData[i]);
}
auto c1 = std::chrono::steady_clock::now();
for(size_t it = 0; it < N_ITERS; ++it)
{
#pragma omp parallel
{
for(size_t i = 0; i < N_BINS_COUNT; ++i)
pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
{
const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
}
}
}
auto c2 = std::chrono::steady_clock::now();
#pragma omp parallel for
for(size_t i = 0; i < N_BINS_COUNT; ++i)
{
for(size_t j = 1; j < omp_get_max_threads(); ++j)
pBins[i] += pBins[j*N_BINS_COUNT + i];
}
std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
size_t total = 0;
#pragma omp parallel for reduction(+:total)
for(size_t i = 0; i < N_BINS_COUNT; ++i)
total += pBins[i];
std::cout << "Total Items: " << total << std::endl;
memkind_free(NULL, pData);
memkind_free(NULL, pBins);
return 0;
}
```
### Using HBM Memory (P10-Intel)
Following commands can be used to compile and run example application above
```bash
ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
```
Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
## Additional Resources
- [https://linux.die.net/man/8/numactl][1]
- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
[1]: https://linux.die.net/man/8/numactl
[2]: http://memkind.github.io/memkind/man_pages/memkind.html
[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
\ No newline at end of file
# Using VMware Horizon
VMware Horizon is a virtual desktop infrastructure (VDI) solution
that enables users to access virtual desktops and applications from any device and any location.
It provides a comprehensive end-to-end solution for managing and delivering virtual desktops and applications,
including features such as session management, user authentication, and virtual desktop provisioning.
![](../../img/horizon.png)
## How to Access VMware Horizon
!!! important
Access to VMware Horizon requires IT4I VPN.
1. Contact [IT4I support][a] with a request for an access and VM allocation.
1. [Download][1] and install the VMware Horizon Client for Windows.
1. Add a new server `https://vdi-cs01.msad.it4i.cz/` in the Horizon client.
1. Connect to the server using your IT4I username and password.
Username is in the `domain\username` format and the domain is `msad.it4i.cz`.
For example: `msad.it4i.cz\user123`
## Example
Below is an example of how to mount a remote folder and check the conection on Windows OS:
### Prerequsities
3D applications
* [Blender][3]
SSHFS for remote access
* [sshfs-win][4]
* [winfsp][5]
* [shfs-win-manager][6]
* ssh keys for access to clusters
### Steps
1. Start the VPN and connect to the server via VMware Horizon Client.
![](../../img/vmware.png)
1. Mount a remote folder.
* Run sshfs-win-manager.
![](../../img/sshfs.png)
* Add a new connection.
![](../../img/sshfs1.png)
* Click on **Connect**.
![](../../img/sshfs2.png)
1. Check that the folder is mounted.
![](../../img/mount.png)
1. Check the GPU resources.
![](../../img/gpu.png)
### Blender
Now if you run, for example, Blender, you can check the available GPU resources in Blender Preferences.
![](../../img/blender.png)
[a]: mailto:support@it4i.cz
[1]: https://vdi-cs01.msad.it4i.cz/
[2]: https://www.paraview.org/download/
[3]: https://www.blender.org/download/
[4]: https://github.com/winfsp/sshfs-win/releases
[5]: https://github.com/winfsp/winfsp/releases/
[6]: https://github.com/evsar3/sshfs-win-manager/releases
# Using IBM Power Partition
For testing your application on the IBM Power partition,
you need to prepare a job script for that partition or use the interactive job:
```console
scalloc -N 1 -c 192 -A PROJECT-ID -p p07-power --time=08:00:00
```
where:
- `-N 1` means allocation single node,
- `-c 192` means allocation 192 cores (threads),
- `-p p07-power` is IBM Power partition,
- `--time=08:00:00` means allocation for 8 hours.
On the partition, you should reload the list of modules:
```
ml architecture/ppc64le
```
The platform offers both `GNU` based and proprietary IBM toolchains for building applications. IBM also provides optimized BLAS routines library ([ESSL](https://www.ibm.com/docs/en/essl/6.1)), which can be used by both toolchain.
## Building Applications
Our sample application depends on `BLAS`, therefore we start by loading following modules (regardless of which toolchain we want to use):
```
ml GCC OpenBLAS
```
### GCC Toolchain
In the case of GCC toolchain we can go ahead and compile the application as usual using either `g++`
```
g++ -lopenblas hello.cpp -o hello
```
or `gfortran`
```
gfortran -lopenblas hello.f90 -o hello
```
as usual.
### IBM Toolchain
The IBM toolchain requires additional environment setup as it is installed in `/opt/ibm` and is not exposed as a module
```
IBM_ROOT=/opt/ibm
OPENXLC_ROOT=$IBM_ROOT/openxlC/17.1.1
OPENXLF_ROOT=$IBM_ROOT/openxlf/17.1.1
export PATH=$OPENXLC_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLC_ROOT/lib:$LD_LIBRARY_PATH
export PATH=$OPENXLF_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLF_ROOT/lib:$LD_LIBRARY_PATH
```
from there we can use either `ibm-clang++`
```
ibm-clang++ -lopenblas hello.cpp -o hello
```
or `xlf`
```
xlf -lopenblas hello.f90 -o hello
```
to build the application as usual.
!!! note
Combination of `xlf` and `openblas` seems to cause severe performance degradation. Therefore `ESSL` library should be preferred (see below).
### Using ESSL Library
The [ESSL](https://www.ibm.com/docs/en/essl/6.1) library is installed in `/opt/ibm/math/essl/7.1` so we define additional environment variables
```
IBM_ROOT=/opt/ibm
ESSL_ROOT=${IBM_ROOT}math/essl/7.1
export LD_LIBRARY_PATH=$ESSL_ROOT/lib64:$LD_LIBRARY_PATH
```
The simplest way to utilize `ESSL` in application, which already uses `BLAS` or `CBLAS` routines is to link with the provided `libessl.so`. This can be done by replacing `-lopenblas` with `-lessl` or `-lessl -lopenblas` (in case `ESSL` does not provide all required `BLAS` routines).
In practice this can look like
```
g++ -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.cpp -o hello
```
or
```
gfortran -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.f90 -o hello
```
and similarly for IBM compilers (`ibm-clang++` and `xlf`).
## Hello World Applications
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in `C++`:
```c++
#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"
const size_t ITERATIONS = 32;
const size_t MATRIX_SIZE = 1024;
int main(int argc, char *argv[])
{
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
for(size_t i = 0; i < MATRIX_SIZE; ++i)
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
a[0] = 0.5f;
std::vector<float> w1(matrixElements, 0.0f);
std::vector<float> w2(matrixElements, 0.0f);
std::copy(a.begin(), a.end(), w1.begin());
std::vector<float> *t1, *t2;
t1 = &w1;
t2 = &w2;
auto c1 = std::chrono::steady_clock::now();
for(size_t i = 0; i < ITERATIONS; ++i)
{
std::fill(t2->begin(), t2->end(), 0.0f);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
1.0f, t1->data(), MATRIX_SIZE,
a.data(), MATRIX_SIZE,
1.0f, t2->data(), MATRIX_SIZE);
std::swap(t1, t2);
}
auto c2 = std::chrono::steady_clock::now();
for(size_t i = 0; i < MATRIX_SIZE; ++i)
{
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
}
std::cout << std::endl;
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
return 0;
}
```
Stationary probability vector estimation in `Fortran`:
```fortran
program main
implicit none
integer :: matrix_size, iterations
integer :: i
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
real, pointer :: out_data(:), out_diag(:)
integer :: cr, cm, c1, c2
iterations = 32
matrix_size = 1024
call system_clock(count_rate=cr)
call system_clock(count_max=cm)
allocate(a(matrix_size, matrix_size))
allocate(w1(matrix_size, matrix_size))
allocate(w2(matrix_size, matrix_size))
a(:,:) = 1.0 / real(matrix_size)
a(:,1) = 0.5 / real(matrix_size - 1)
a(1,1) = 0.5
w1 = a
w2(:,:) = 0.0
t1 => w1
t2 => w2
call system_clock(c1)
do i = 0, iterations
t2(:,:) = 0.0
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
tmp => t1
t1 => t2
t2 => tmp
end do
call system_clock(c2)
out_data(1:size(t1)) => t1
out_diag => out_data(1::matrix_size+1)
print *, out_diag
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
deallocate(a)
deallocate(w1)
deallocate(w2)
end program main
```