Skip to content
Snippets Groups Projects
tensorflow.md 4.95 KiB
Newer Older
  • Learn to ignore specific revisions
  • Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    # TensorFlow
    
    
    TensorFlow (TF) is an open-source software library which can compile tensor operations to execute
    very quickly on both CPUs and GPUs. It is often used as a backend for machine learning libraries
    and models.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    We heavily recommend the usage of `TensorFlow 2.x`. TensorFlow 1 has been long deprecated and it
    
    will probably be difficult to make it run on GPUs on our clusters.
    
    ## Installation
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    For TensorFlow to work with GPUs, you have to use several libraries (CUDA, cuDNN, NCCL etc.)
    with versions that are compatible together.
    
    You can load the correct modules with the following command:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    
    If you want to upgrade the TensorFlow version used in this package or install additional Python
    modules, you can simply create a virtual environment and install a different TensorFlow version
    inside it:
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    ```console
    $ python3 -m venv venv
    $ source venv/bin/activate
    (venv) $ python3 -m pip install -U setuptools wheel pip
    (venv) $ python3 -m pip install tensorflow
    ```
    
    However, if you use a newer TensorFlow version than the one included in the `TensorFlow` module,
    you should make sure that it is still compatible with the CUDA version provided by the module.
    You can find the required `CUDA`/`cuDNN` versions for the latest TF
    [here](https://www.tensorflow.org/install/pip).
    
    ## TensorFlow Example
    
    After loading TensorFlow, you can check its functionality by running the following Python script.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    ```python
    import tensorflow as tf
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    a = tf.constant([1, 2, 3])
    b = tf.constant([2, 4, 6])
    c = a + b
    print(c.numpy())
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    ## Using TensorFlow With GPUs
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    With TensorFlow, you can leverage either a single GPU or multiple GPUs in a single process, to e.g.
    train neural networks much faster.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Using the available `TensorFlow` module should make sure that these modules will be loaded correctly.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    You can select how many and which (NVIDIA) GPUs will be used by TensorFlow with the
    
    `CUDA_VISIBLE_DEVICES` environment variable.
    
    ```console
    # Do not use any GPUs
    $ CUDA_VISIBLE_DEVICES=-1 python3 my_script.py
    # Use a single GPU with ID 0
    $ CUDA_VISIBLE_DEVICES=0 python3 my_script.py
    # Use multiple GPUs
    $ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 my_script.py
    ```
    
    By default, if you do not specify the environment variable, all available GPUs will be used by
    TensorFlow.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    ### Multi-GPU TensorFlow Example
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    This script uses `keras` and `TensorFlow` to train a simple neural network on the
    [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. It assumes that you have
    `tensorflow` (2.x), `keras` and `tensorflow_datasets` Python packages installed. The training
    is performed on multiple GPUs.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    ```python
    
    import tensorflow_datasets as tfds
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    import tensorflow as tf
    
    
    datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
    
    mnist_train, mnist_test = datasets['train'], datasets['test']
    
    # Use NCCL reduction if NCCL is available, it should be the most efficient strategy
    strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce())
    
    # Different reduction strategy, use if NCCL causes errors
    # strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
    print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    
    num_train_examples = info.splits['train'].num_examples
    num_test_examples = info.splits['test'].num_examples
    
    BUFFER_SIZE = 10000
    
    BATCH_SIZE_PER_REPLICA = 64
    BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
    
    def scale(image, label):
      image = tf.cast(image, tf.float32)
      image /= 255
    
      return image, label
    
    
    train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
    eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
    
    # The following line makes sure that the model will run on multiple GPUs (if they are available)
    # Without `strategy.scopy()`, the model would only be trained on a single GPU
    with strategy.scope():
      model = tf.keras.Sequential([
          tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
          tf.keras.layers.MaxPooling2D(),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(64, activation='relu'),
          tf.keras.layers.Dense(10)
      ])
    
      model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                    optimizer=tf.keras.optimizers.Adam(),
                    metrics=['accuracy'])
    
    model.fit(train_dataset, epochs=100)
    
    !!! note
        If using the `NCCL` strategy causes runtime errors, try to run your application with the
        environment variable `TF_FORCE_GPU_ALLOW_GROWTH` set to `true`.
    
    !!! tip
        For real-world multi-GPU training, it might be better to use a dedicated multi-GPU framework such
        as [Horovod](https://github.com/horovod/horovod).
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    <!---
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    2022-10-14
    Add multi-GPU example script.
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    2021-04-08
    It's necessary to load the correct NumPy module along with the Tensorflow one.
    
    2021-03-31
    ## Notes
    As of 2021-03-23, TensorFlow is made available only on the Salomon cluster
    
    Tensorflow-tensorboard/1.5.1-Py-3.6 has not been not tested.
    
    Jan Siwiec's avatar
    Jan Siwiec committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    -->