From 62568cb5a2395589673008d3e350c3f4e6bdc217 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jakub=20Ber=C3=A1nek?= <berykubik@gmail.com>
Date: Wed, 5 Oct 2022 08:56:43 +0200
Subject: [PATCH 1/2] Add section about using TensorFlow with (multiple) GPUs

---
 .../software/machine-learning/introduction.md |   2 +-
 .../software/machine-learning/tensorflow.md   | 138 +++++++++++++++---
 2 files changed, 119 insertions(+), 21 deletions(-)

diff --git a/docs.it4i/software/machine-learning/introduction.md b/docs.it4i/software/machine-learning/introduction.md
index 3bb01ad05..df26dcf86 100644
--- a/docs.it4i/software/machine-learning/introduction.md
+++ b/docs.it4i/software/machine-learning/introduction.md
@@ -21,7 +21,7 @@ For more information, see the [official website][d] or [GitHub][e].
 
 TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. For more information, see the [official website][a].
 
-For the list of available versions, see the [TensorFlow][1] section:
+For more information see the [TensorFlow][1] section.
 
 ## Theano
 
diff --git a/docs.it4i/software/machine-learning/tensorflow.md b/docs.it4i/software/machine-learning/tensorflow.md
index 2059b8306..9fc9a1ab0 100644
--- a/docs.it4i/software/machine-learning/tensorflow.md
+++ b/docs.it4i/software/machine-learning/tensorflow.md
@@ -1,46 +1,144 @@
 # TensorFlow
 
-TensorFlow is an open-source software library for machine intelligence.
-For searching available modules type:
+TensorFlow (TF) is an open-source software library which can compile tensor operations to execute
+very quickly on both CPUs and GPUs. It is often used as a backend for machine learning libraries
+and models.
 
+We heavily recommend the usage of `Tensorflow 2.x`. TensorFlow 1 has been long deprecated and it
+will probably be difficult to make it run on GPUs on our clusters.
+
+## Installation
+For TensorFlow to work with GPUs, you have to use several libraries (CUDA, cuDNN, NCCL etc.)
+with versions that are compatible together.
+
+You can load the correct modules with the following command:
 ```console
-$ ml av Tensorflow
+$ ml TensorFlow
 ```
 
-<!---
+If you want to upgrade the TensorFlow version used in this package or install additional Python
+modules, you can simply create a virtual environment and install a different TensorFlow version
+inside it:
+```console
+$ python3 -m venv venv
+$ source venv/bin/activate
+(venv) $ python3 -m pip install -U setuptools wheel pip
+(venv) $ python3 -m pip install tensorflow
+```
+
+However, if you use a newer TensorFlow version than the one included in the `TensorFlow` module,
+you should make sure that it is still compatible with the CUDA version provided by the module.
+You can find the required `CUDA`/`cuDNN` versions for the latest TF
+[here](https://www.tensorflow.org/install/pip).
+
+## TensorFlow Example
+
+After loading TensorFlow, you can check its functionality by running the following Python script.
 
-## Salomon Modules
+```python
+import tensorflow as tf
 
-Salomon provides (besides other) these TensorFlow modules:
+a = tf.constant([1, 2, 3])
+b = tf.constant([2, 4, 6])
+c = a + b
+print(c.numpy())
+```
 
-**Tensorflow/1.1.0** (not recommended), module built with:
+## Using TensorFlow With GPUs
 
-* GCC/4.9.3
-* Python/3.6.1
+With TensorFlow, you can leverage either a single GPU or multiple GPUs in a single process, to e.g.
+train neural networks much faster.
 
-**Tensorflow/1.2.0-GCC-7.1.0-2.28** (default, recommended), module built with:
+Using the available `TensorFlow` module should make sure that these modules will be loaded correctly.
 
-* TensorFlow 1.2 with SIMD support. TensorFlow build taking advantage of the Salomon CPU architecture.
-* GCC/7.1.0-2.28
-* Python/3.6.1
-* protobuf/3.2.0-GCC-7.1.0-2.28-Python-3.6.1
+### Selecting GPUs
 
--->
+You can select how many and which (Nvidia) GPUs will be used by TensorFlow with the
+`CUDA_VISIBLE_DEVICES` environment variable.
+
+```console
+# Do not use any GPUs
+$ CUDA_VISIBLE_DEVICES=-1 python3 my_script.py
+# Use a single GPU with ID 0
+$ CUDA_VISIBLE_DEVICES=0 python3 my_script.py
+# Use multiple GPUs
+$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 my_script.py
+```
+
+By default, if you do not specify the environment variable, all available GPUs will be used by
+TensorFlow.
 
-## TensorFlow Application Example
+### Multi-GPU TensorFlow Example
 
-After loading one of the available TensorFlow modules, you can check the functionality by running the following Python script.
+This script uses `keras` and `TensorFlow` to train a simple neural network on the
+[MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. It assumes that you have
+`tensorflow` (2.x), `keras` and `tensorflow_datasets` Python packages installed. The training
+is performed on multiple GPUs.
 
 ```python
+import tensorflow_datasets as tfds
 import tensorflow as tf
 
-c = tf.constant('Hello World!')
-sess = tf.Session()
-print(sess.run(c))
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+mnist_train, mnist_test = datasets['train'], datasets['test']
+
+# Use NCCL reduction if NCCL is available, it should be the most efficient strategy
+strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce())
+
+# Different reduction strategy, use if NCCL causes errors
+# strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
+print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
+
+num_train_examples = info.splits['train'].num_examples
+num_test_examples = info.splits['test'].num_examples
+
+BUFFER_SIZE = 10000
+
+BATCH_SIZE_PER_REPLICA = 64
+BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+
+def scale(image, label):
+  image = tf.cast(image, tf.float32)
+  image /= 255
+
+  return image, label
+
+
+train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
+eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
+
+# The following line makes sure that the model will run on multiple GPUs (if they are available)
+# Without `strategy.scopy()`, the model would only be trained on a single GPU
+with strategy.scope():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+
+  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+                optimizer=tf.keras.optimizers.Adam(),
+                metrics=['accuracy'])
+
+model.fit(train_dataset, epochs=100)
 ```
 
+!!! note
+    If using the `NCCL` strategy causes runtime errors, try to run your application with the
+    environment variable `TF_FORCE_GPU_ALLOW_GROWTH` set to `true`.
+
+!!! tip
+    For real-world multi-GPU training, it might be better to use a dedicated multi-GPU framework such
+    as [Horovod](https://github.com/horovod/horovod).
+
 <!---
 
+2022-10-14
+Add multi-GPU example script.
+
 2021-04-08
 It's necessary to load the correct NumPy module along with the Tensorflow one.
 
-- 
GitLab


From 6140f612939b9abc25bd169baed7c1f85a7b77d6 Mon Sep 17 00:00:00 2001
From: Jan Siwiec <jan.siwiec@vsb.cz>
Date: Tue, 18 Oct 2022 09:36:06 +0200
Subject: [PATCH 2/2] Update tensorflow.md

---
 docs.it4i/software/machine-learning/tensorflow.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs.it4i/software/machine-learning/tensorflow.md b/docs.it4i/software/machine-learning/tensorflow.md
index 9fc9a1ab0..8e7386269 100644
--- a/docs.it4i/software/machine-learning/tensorflow.md
+++ b/docs.it4i/software/machine-learning/tensorflow.md
@@ -4,14 +4,16 @@ TensorFlow (TF) is an open-source software library which can compile tensor oper
 very quickly on both CPUs and GPUs. It is often used as a backend for machine learning libraries
 and models.
 
-We heavily recommend the usage of `Tensorflow 2.x`. TensorFlow 1 has been long deprecated and it
+We heavily recommend the usage of `TensorFlow 2.x`. TensorFlow 1 has been long deprecated and it
 will probably be difficult to make it run on GPUs on our clusters.
 
 ## Installation
+
 For TensorFlow to work with GPUs, you have to use several libraries (CUDA, cuDNN, NCCL etc.)
 with versions that are compatible together.
 
 You can load the correct modules with the following command:
+
 ```console
 $ ml TensorFlow
 ```
@@ -19,6 +21,7 @@ $ ml TensorFlow
 If you want to upgrade the TensorFlow version used in this package or install additional Python
 modules, you can simply create a virtual environment and install a different TensorFlow version
 inside it:
+
 ```console
 $ python3 -m venv venv
 $ source venv/bin/activate
@@ -53,7 +56,7 @@ Using the available `TensorFlow` module should make sure that these modules will
 
 ### Selecting GPUs
 
-You can select how many and which (Nvidia) GPUs will be used by TensorFlow with the
+You can select how many and which (NVIDIA) GPUs will be used by TensorFlow with the
 `CUDA_VISIBLE_DEVICES` environment variable.
 
 ```console
-- 
GitLab