diff --git a/docs.it4i/software/machine-learning/introduction.md b/docs.it4i/software/machine-learning/introduction.md index 3bb01ad057cdc4e7fc296e6a48fc083762b60b99..df26dcf860241e43cb7237321a1ca3690a2d50ef 100644 --- a/docs.it4i/software/machine-learning/introduction.md +++ b/docs.it4i/software/machine-learning/introduction.md @@ -21,7 +21,7 @@ For more information, see the [official website][d] or [GitHub][e]. TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. For more information, see the [official website][a]. -For the list of available versions, see the [TensorFlow][1] section: +For more information see the [TensorFlow][1] section. ## Theano diff --git a/docs.it4i/software/machine-learning/tensorflow.md b/docs.it4i/software/machine-learning/tensorflow.md index 2059b8306a33d4bba6c4c7a056ffc54e92c61973..8e7386269a8c778331b87eb40d1d45da22adc678 100644 --- a/docs.it4i/software/machine-learning/tensorflow.md +++ b/docs.it4i/software/machine-learning/tensorflow.md @@ -1,46 +1,147 @@ # TensorFlow -TensorFlow is an open-source software library for machine intelligence. -For searching available modules type: +TensorFlow (TF) is an open-source software library which can compile tensor operations to execute +very quickly on both CPUs and GPUs. It is often used as a backend for machine learning libraries +and models. + +We heavily recommend the usage of `TensorFlow 2.x`. TensorFlow 1 has been long deprecated and it +will probably be difficult to make it run on GPUs on our clusters. + +## Installation + +For TensorFlow to work with GPUs, you have to use several libraries (CUDA, cuDNN, NCCL etc.) +with versions that are compatible together. + +You can load the correct modules with the following command: ```console -$ ml av Tensorflow +$ ml TensorFlow ``` -<!--- +If you want to upgrade the TensorFlow version used in this package or install additional Python +modules, you can simply create a virtual environment and install a different TensorFlow version +inside it: + +```console +$ python3 -m venv venv +$ source venv/bin/activate +(venv) $ python3 -m pip install -U setuptools wheel pip +(venv) $ python3 -m pip install tensorflow +``` + +However, if you use a newer TensorFlow version than the one included in the `TensorFlow` module, +you should make sure that it is still compatible with the CUDA version provided by the module. +You can find the required `CUDA`/`cuDNN` versions for the latest TF +[here](https://www.tensorflow.org/install/pip). -## Salomon Modules +## TensorFlow Example -Salomon provides (besides other) these TensorFlow modules: +After loading TensorFlow, you can check its functionality by running the following Python script. -**Tensorflow/1.1.0** (not recommended), module built with: +```python +import tensorflow as tf -* GCC/4.9.3 -* Python/3.6.1 +a = tf.constant([1, 2, 3]) +b = tf.constant([2, 4, 6]) +c = a + b +print(c.numpy()) +``` -**Tensorflow/1.2.0-GCC-7.1.0-2.28** (default, recommended), module built with: +## Using TensorFlow With GPUs -* TensorFlow 1.2 with SIMD support. TensorFlow build taking advantage of the Salomon CPU architecture. -* GCC/7.1.0-2.28 -* Python/3.6.1 -* protobuf/3.2.0-GCC-7.1.0-2.28-Python-3.6.1 +With TensorFlow, you can leverage either a single GPU or multiple GPUs in a single process, to e.g. +train neural networks much faster. ---> +Using the available `TensorFlow` module should make sure that these modules will be loaded correctly. + +### Selecting GPUs + +You can select how many and which (NVIDIA) GPUs will be used by TensorFlow with the +`CUDA_VISIBLE_DEVICES` environment variable. + +```console +# Do not use any GPUs +$ CUDA_VISIBLE_DEVICES=-1 python3 my_script.py +# Use a single GPU with ID 0 +$ CUDA_VISIBLE_DEVICES=0 python3 my_script.py +# Use multiple GPUs +$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 my_script.py +``` -## TensorFlow Application Example +By default, if you do not specify the environment variable, all available GPUs will be used by +TensorFlow. -After loading one of the available TensorFlow modules, you can check the functionality by running the following Python script. +### Multi-GPU TensorFlow Example + +This script uses `keras` and `TensorFlow` to train a simple neural network on the +[MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. It assumes that you have +`tensorflow` (2.x), `keras` and `tensorflow_datasets` Python packages installed. The training +is performed on multiple GPUs. ```python +import tensorflow_datasets as tfds import tensorflow as tf -c = tf.constant('Hello World!') -sess = tf.Session() -print(sess.run(c)) +datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True) + +mnist_train, mnist_test = datasets['train'], datasets['test'] + +# Use NCCL reduction if NCCL is available, it should be the most efficient strategy +strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) + +# Different reduction strategy, use if NCCL causes errors +# strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice()) +print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) + +num_train_examples = info.splits['train'].num_examples +num_test_examples = info.splits['test'].num_examples + +BUFFER_SIZE = 10000 + +BATCH_SIZE_PER_REPLICA = 64 +BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync + +def scale(image, label): + image = tf.cast(image, tf.float32) + image /= 255 + + return image, label + + +train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE) +eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE) + +# The following line makes sure that the model will run on multiple GPUs (if they are available) +# Without `strategy.scopy()`, the model would only be trained on a single GPU +with strategy.scope(): + model = tf.keras.Sequential([ + tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)), + tf.keras.layers.MaxPooling2D(), + tf.keras.layers.Flatten(), + tf.keras.layers.Dense(64, activation='relu'), + tf.keras.layers.Dense(10) + ]) + + model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), + optimizer=tf.keras.optimizers.Adam(), + metrics=['accuracy']) + +model.fit(train_dataset, epochs=100) ``` +!!! note + If using the `NCCL` strategy causes runtime errors, try to run your application with the + environment variable `TF_FORCE_GPU_ALLOW_GROWTH` set to `true`. + +!!! tip + For real-world multi-GPU training, it might be better to use a dedicated multi-GPU framework such + as [Horovod](https://github.com/horovod/horovod). + <!--- +2022-10-14 +Add multi-GPU example script. + 2021-04-08 It's necessary to load the correct NumPy module along with the Tensorflow one.