Difference between revisions of "Power8 GPU Nodes"

From CAC Wiki
Jump to: navigation, search
Line 117: Line 117:
 
  module spider cuda                      # search Lmod for CUDA modules
 
  module spider cuda                      # search Lmod for CUDA modules
 
  module load cuda/10.1                    # load CUDA module
 
  module load cuda/10.1                    # load CUDA module
module list
+
module list
cd /global/home/hpc1006/nvcctest          # sample job
+
cd /global/home/hpc1006/nvcctest          # sample job
rm /global/home/hpc1006/nvcctest/hello
+
rm /global/home/hpc1006/nvcctest/hello
nvcc hello.cu -o hello
+
nvcc hello.cu -o hello
/global/home/hpc1006/nvcctest/hello
+
/global/home/hpc1006/nvcctest/hello

Revision as of 17:14, 25 November 2019

IBM Power 8 GPU Cluster

Hardware:

5 - S822LC IBM Power 8 servers each with 
4 - Nvidia P100 16GB GPUs 
2 - 8 core sockets, 3 threads per core
512 GB memory


Software:

CUDA/10.1
PGI/19.1
Anaconda3/2019.3


Lmod

Runs a local Lmod system not connected to the Compute Canada Lmod.
To find a software module use 'module spider <software-name>'

e.g.
hpc1006@cac155: module spider cuda

---------------------------------------------------------------------------------------
  cuda: cuda/10.1
---------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and
      implemented by the graphics processing units (GPUs) that they produce. CUDA gives developers access to the virtual instruction set
      and memory of the parallel computational elements in CUDA GPUs.

hpc1006@cac155: module load cpu/10.1
hpc1006@cac155: module list
Currently Loaded Modules:
  1) cuda/10.1

Anaconda

There is a global install of Anaconda 3.7.3 installed. To use it, load it via lmod

 i.e.
 hpc1006@cac155 [108]: module spider anaconda

  ----------------------------------------------
  anaconda3: anaconda3/2019.3
  -----------------------------------------------
    Description:
      Anaconda Python distribution - version 3 - 2019-03


    This module can be loaded directly: module load anaconda3/2019.3

 hpc1006@cac155 [109]: module load anaconda3/2019.3
 hpc1006@cac155 [110]: module list

 Currently Loaded Modules:
  1) anaconda3/2019.3
 


IBM Watson Machine Learning Community Edition

The IBM PowerAI/Deep Learning frameworks is a collections of packages such as Tensorflow, Caffe and Pytorch. They are installed in the globally installed Anaconda3 and can be used by loading the anaconda3/2019.3 module

e.g.

 module load anaconda3/2019.3
 hpc1006@cac155 [107]: python
 Python 3.7.3 (default, Mar 27 2019, 22:31:02) 
 [GCC 7.3.0] :: Anaconda, Inc. on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import tensorflow
 2019-11-21 15:57:02.005669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 
 >>>



Miniconda:

It's recommended for users to install their choice of miniconda in their home directory.

e.g.  Choose version here, https://repo.continuum.io/miniconda/
Note:  it must be a ppcle version
>wget https://repo.continuum.io/miniconda/Miniconda2-4.7.12.1-Linux-ppc64le.sh
>bash Miniconda3-4.7.12.1-Linux-x86_64.sh
- Answer 'yes' to accept agreement-
- Choose install directory, default is $HOME/miniconda3 but you may choose to install it in 
a custom directory for power pc to avoid conflicting with any X86 installs.
e.g. mkdir $HOME/ppc  and use $HOME/ppc/miniconda3 as your install directory
- Answer 'yes' to running conda init if you want the miniconda/bin directory in your $PATH


Submitting Jobs

The Power 8 GPU nodes use a resource manager called Slurm[1]. See the CAC wiki[[2]] for details.

Here's sample slurm sbatch script.

#!/bin/bash
#SBATCH --export=none                    # clear your env vars
#SBATCH --partition=power-gpu            # specify the Power GPU nodes
#SBATCH --account=<your-account-here>
#SBATCH --qos=gpu                        # Specify the Quality of Service needed for using GPUs at CAC
#SBATCH --gres=gpu:1                     # Number of GPUs (per node)
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=500M                       # memory (per node)
#SBATCH --time=0-01:00                   # time (DD-H
export LD_LIBRARY_PATH=/usr/lib64/nvidia # set Nvidia library path
hostname                                 # show node it ran on
source /etc/profile.d/z-20-lmod.sh       # load Lmod environment
module spider cuda                       # search Lmod for CUDA modules
module load cuda/10.1                    # load CUDA module
module list
cd /global/home/hpc1006/nvcctest          # sample job
rm /global/home/hpc1006/nvcctest/hello
nvcc hello.cu -o hello
/global/home/hpc1006/nvcctest/hello