Merge branch 'tf-gpu' into 'master'

Add section about using TensorFlow with GPUs See merge request !411

Merge branch 'tf-gpu' into 'master'
8ba253f3 · Jan Siwiec · ede3d82f · 6140f612 · 8ba253f3 · 8ba253f3
Commit 8ba253f3 authored 2 years ago by Jan Siwiec
--- a/docs.it4i/software/machine-learning/introduction.md
+++ b/docs.it4i/software/machine-learning/introduction.md
@@ -21,7 +21,7 @@ For more information, see the [official website][d] or [GitHub][e].
 TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. For more information, see the [official website][a].
-For the list of available versions, see the [TensorFlow][1] section:
+For more information see the [TensorFlow][1] section.
 ## Theano

--- a/docs.it4i/software/machine-learning/tensorflow.md
+++ b/docs.it4i/software/machine-learning/tensorflow.md
 # TensorFlow
-TensorFlow is an open-source software library for machine intelligence.
+TensorFlow (TF) is an open-source software library which can compile tensor operations to execute
-For searching available modules type:
+very quickly on both CPUs and GPUs. It is often used as a backend for machine learning libraries
+and models.
+We heavily recommend the usage of `TensorFlow 2.x`. TensorFlow 1 has been long deprecated and it
+will probably be difficult to make it run on GPUs on our clusters.
+## Installation
+For TensorFlow to work with GPUs, you have to use several libraries (CUDA, cuDNN, NCCL etc.)
+with versions that are compatible together.
+You can load the correct modules with the following command:
 ```console
-$ ml av Tensorflow
+$ ml TensorFlow
 ```
-<!---
+If you want to upgrade the TensorFlow version used in this package or install additional Python
+modules, you can simply create a virtual environment and install a different TensorFlow version
+inside it:
+```console
+$ python3 -m venv venv
+$ source venv/bin/activate
+(venv) $ python3 -m pip install -U setuptools wheel pip
+(venv) $ python3 -m pip install tensorflow
+```
+However, if you use a newer TensorFlow version than the one included in the `TensorFlow` module,
+you should make sure that it is still compatible with the CUDA version provided by the module.
+You can find the required `CUDA`/`cuDNN` versions for the latest TF
+[here](https://www.tensorflow.org/install/pip).
-## Salomon Modules
+## TensorFlow Example
-Salomon provides (besides other) these TensorFlow modules:
+After loading TensorFlow, you can check its functionality by running the following Python script.
-**Tensorflow/1.1.0** (not recommended), module built with:
+```python
+import tensorflow as tf
-* GCC/4.9.3
+a = tf.constant([1, 2, 3])
-* Python/3.6.1
+b = tf.constant([2, 4, 6])
+c = a + b
+print(c.numpy())
+```
-**Tensorflow/1.2.0-GCC-7.1.0-2.28** (default, recommended), module built with:
+## Using TensorFlow With GPUs
-* TensorFlow 1.2 with SIMD support. TensorFlow build taking advantage of the Salomon CPU architecture.
+With TensorFlow, you can leverage either a single GPU or multiple GPUs in a single process, to e.g.
-* GCC/7.1.0-2.28
+train neural networks much faster.
-* Python/3.6.1
-* protobuf/3.2.0-GCC-7.1.0-2.28-Python-3.6.1
-->
+Using the available `TensorFlow` module should make sure that these modules will be loaded correctly.
+### Selecting GPUs
+You can select how many and which (NVIDIA) GPUs will be used by TensorFlow with the
+`CUDA_VISIBLE_DEVICES` environment variable.
+```console
+# Do not use any GPUs
+$ CUDA_VISIBLE_DEVICES=-1 python3 my_script.py
+# Use a single GPU with ID 0
+$ CUDA_VISIBLE_DEVICES=0 python3 my_script.py
+# Use multiple GPUs
+$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 my_script.py
+```
-## TensorFlow Application Example
+By default, if you do not specify the environment variable, all available GPUs will be used by
+TensorFlow.
-After loading one of the available TensorFlow modules, you can check the functionality by running the following Python script.
+### Multi-GPU TensorFlow Example
+This script uses `keras` and `TensorFlow` to train a simple neural network on the
+[MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. It assumes that you have
+`tensorflow` (2.x), `keras` and `tensorflow_datasets` Python packages installed. The training
+is performed on multiple GPUs.
 ```python
+import tensorflow_datasets as tfds
 import tensorflow as tf
-c = tf.constant('Hello World!')
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
-sess = tf.Session()
-print(sess.run(c))
+mnist_train, mnist_test = datasets['train'], datasets['test']
+# Use NCCL reduction if NCCL is available, it should be the most efficient strategy
+strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce())
+# Different reduction strategy, use if NCCL causes errors
+# strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
+print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
+num_train_examples = info.splits['train'].num_examples
+num_test_examples = info.splits['test'].num_examples
+BUFFER_SIZE = 10000
+BATCH_SIZE_PER_REPLICA = 64
+BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+def scale(image, label):
+  image = tf.cast(image, tf.float32)
+  image /= 255
+  return image, label
+train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
+eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
+# The following line makes sure that the model will run on multiple GPUs (if they are available)
+# Without `strategy.scopy()`, the model would only be trained on a single GPU
+with strategy.scope():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+                optimizer=tf.keras.optimizers.Adam(),
+                metrics=['accuracy'])
+model.fit(train_dataset, epochs=100)
 ```
+!!! note
+    If using the `NCCL` strategy causes runtime errors, try to run your application with the
+    environment variable `TF_FORCE_GPU_ALLOW_GROWTH` set to `true`.
+!!! tip
+    For real-world multi-GPU training, it might be better to use a dedicated multi-GPU framework such
+    as [Horovod](https://github.com/horovod/horovod).
 <!---
+2022-10-14
+Add multi-GPU example script.
 2021-04-08
 It's necessary to load the correct NumPy module along with the Tensorflow one.