Difference between revisions of "Power8 GPU Nodes"
Cacwikiadmin (Talk | contribs) (Created page with "== IBM Power 8 GPU Cluster == == Hardware: == 5 - S822LC IBM Power 8 servers each with 4 - Nvidia P100 16GB GPUs 2 - 8 core sockets, 3 threads per core 512 GB memory...") |
|||
(8 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | === NOTE: THIS CLUSTER IS CURRENTLY OFFLINE AND WILL BE REDEPLOYED AS A SPECIAL PROJECT CLUSTER === | ||
+ | |||
+ | |||
+ | |||
== IBM Power 8 GPU Cluster == | == IBM Power 8 GPU Cluster == | ||
+ | The Power 8 GPU cluster is currently 5 - IBM S822lc Power 8 boxes with 4-Nvidia P100 w/ 16GB memory. | ||
+ | |||
+ | This cluster is currently in beta so expect changes. Potential users will need to contact CAC and make arrangements in order to access these machines. | ||
+ | |||
+ | == Logging In: == | ||
+ | Once users have an activated CAC account, users will need to log into login.cac.queensu.ca. | ||
+ | |||
+ | Users must log into the main CAC login node and then jump to the Power 8 login node (cac151) as you need to be on a node with the correct architecture to test jobs. | ||
+ | e.g. ssh <username>@login.cac.queensu.ca | ||
+ | Once there, they can jump to the Power 8 GPU login node | ||
+ | e.g. ssh <username>@cac151 | ||
+ | This node is for downloading and installing packages that you may require. Compiling must also be done on this node and not the regular CAC login node as the architecture is different. | ||
+ | Test jobs may also be run on this node. Note, test jobs must be short and only for testing purposes. Production jobs must be submitted via | ||
+ | [[Power8_GPU_Nodes#Submitting Jobs|Slurm]] to the Power GPU compute nodes. | ||
== Hardware: == | == Hardware: == | ||
Line 9: | Line 27: | ||
− | ==Software: == | + | == Software: == |
− | + | ||
− | + | ||
CUDA/10.1 | CUDA/10.1 | ||
PGI/19.1 | PGI/19.1 | ||
+ | Anaconda3/2019.3 | ||
− | ==Lmod== | + | |
+ | == Lmod == | ||
Runs a local Lmod system not connected to the Compute Canada Lmod.<br> | Runs a local Lmod system not connected to the Compute Canada Lmod.<br> | ||
To find a software module use ''''module spider <software-name>'''' | To find a software module use ''''module spider <software-name>'''' | ||
Line 35: | Line 53: | ||
1) cuda/10.1 | 1) cuda/10.1 | ||
</pre> | </pre> | ||
+ | |||
+ | === Anaconda === | ||
+ | There is a global install of Anaconda 3.7.3 installed. To use it, load it via lmod | ||
+ | <pre> | ||
+ | i.e. | ||
+ | hpc1006@cac155 [108]: module spider anaconda | ||
+ | |||
+ | ---------------------------------------------- | ||
+ | anaconda3: anaconda3/2019.3 | ||
+ | ----------------------------------------------- | ||
+ | Description: | ||
+ | Anaconda Python distribution - version 3 - 2019-03 | ||
+ | |||
+ | |||
+ | This module can be loaded directly: module load anaconda3/2019.3 | ||
+ | |||
+ | hpc1006@cac155 [109]: module load anaconda3/2019.3 | ||
+ | hpc1006@cac155 [110]: module list | ||
+ | |||
+ | Currently Loaded Modules: | ||
+ | 1) anaconda3/2019.3 | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | === IBM Watson Machine Learning Community Edition === | ||
+ | |||
+ | The IBM PowerAI/Deep Learning frameworks is a collections of packages such as | ||
+ | Tensorflow, Caffe and Pytorch. They are installed in the globally installed Anaconda3 and can be | ||
+ | used by loading the anaconda3/2019.3 module | ||
+ | |||
+ | '''e.g.''' | ||
+ | module load anaconda3/2019.3 | ||
+ | hpc1006@cac155 [107]: python | ||
+ | Python 3.7.3 (default, Mar 27 2019, 22:31:02) | ||
+ | [GCC 7.3.0] :: Anaconda, Inc. on linux | ||
+ | Type "help", "copyright", "credits" or "license" for more information. | ||
+ | >>> import tensorflow | ||
+ | 2019-11-21 15:57:02.005669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 | ||
+ | >>> | ||
+ | |||
+ | |||
Line 51: | Line 110: | ||
e.g. mkdir $HOME/ppc and use $HOME/ppc/miniconda3 as your install directory | e.g. mkdir $HOME/ppc and use $HOME/ppc/miniconda3 as your install directory | ||
- Answer 'yes' to running conda init if you want the miniconda/bin directory in your $PATH | - Answer 'yes' to running conda init if you want the miniconda/bin directory in your $PATH | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | == Submitting Jobs == | ||
+ | |||
+ | The Power 8 GPU nodes use a resource manager called Slurm[https://slurm.schedmd.com/overview.html]. | ||
+ | See the CAC wiki[[https://cac.queensu.ca/wiki/index.php/SLURM]] for details. Note, currently due to conflicts with the non-powerpc architecture of the rest of the Frontenac cluster, your script must include | ||
+ | #SBATCH --export=none # clear your env vars | ||
+ | #SBATCH --export LD_LIBRARY_PATH=/usr/lib64/nvidia # set Nvidia library path | ||
+ | |||
+ | Here's sample slurm sbatch script. | ||
+ | #!/bin/bash | ||
+ | #SBATCH --export=none # clear your env vars | ||
+ | #SBATCH --partition=power-gpu # specify the Power GPU nodes | ||
+ | #SBATCH --account=<your-account-here> | ||
+ | #SBATCH --qos=gpu # Specify the Quality of Service needed for using GPUs at CAC | ||
+ | #SBATCH --gres=gpu:1 # Number of GPUs (per node) | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | #SBATCH --mem=500M # memory (per node) | ||
+ | #SBATCH --time=0-01:00 # time (DD-H | ||
+ | export LD_LIBRARY_PATH=/usr/lib64/nvidia # set Nvidia library path | ||
+ | hostname # show node it ran on | ||
+ | source /etc/profile.d/z-20-lmod.sh # load Lmod environment | ||
+ | module spider cuda # search Lmod for CUDA modules | ||
+ | module load cuda/10.1 # load CUDA module | ||
+ | module list | ||
+ | cd /<your_directory>/nvcctest # sample job | ||
+ | rm /<your_directory>/nvcctest/hello # remove previous run | ||
+ | nvcc hello.cu -o hello # see below for hello.cu | ||
+ | /<your_directory>/nvcctest/hello # execute hello executable | ||
+ | |||
+ | '''hello.cu''' | ||
+ | <pre> | ||
+ | #include <stdio.h> | ||
+ | |||
+ | __global__ void print_kernel() { | ||
+ | printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x); | ||
+ | } | ||
+ | |||
+ | int main() { | ||
+ | print_kernel<<<10, 10>>>(); | ||
+ | cudaDeviceSynchronize(); | ||
+ | } | ||
</pre> | </pre> |
Latest revision as of 13:05, 19 October 2022
Contents
NOTE: THIS CLUSTER IS CURRENTLY OFFLINE AND WILL BE REDEPLOYED AS A SPECIAL PROJECT CLUSTER
IBM Power 8 GPU Cluster
The Power 8 GPU cluster is currently 5 - IBM S822lc Power 8 boxes with 4-Nvidia P100 w/ 16GB memory.
This cluster is currently in beta so expect changes. Potential users will need to contact CAC and make arrangements in order to access these machines.
Logging In:
Once users have an activated CAC account, users will need to log into login.cac.queensu.ca.
Users must log into the main CAC login node and then jump to the Power 8 login node (cac151) as you need to be on a node with the correct architecture to test jobs.
e.g. ssh <username>@login.cac.queensu.ca
Once there, they can jump to the Power 8 GPU login node
e.g. ssh <username>@cac151
This node is for downloading and installing packages that you may require. Compiling must also be done on this node and not the regular CAC login node as the architecture is different.
Test jobs may also be run on this node. Note, test jobs must be short and only for testing purposes. Production jobs must be submitted via Slurm to the Power GPU compute nodes.
Hardware:
5 - S822LC IBM Power 8 servers each with 4 - Nvidia P100 16GB GPUs 2 - 8 core sockets, 3 threads per core 512 GB memory
Software:
CUDA/10.1 PGI/19.1 Anaconda3/2019.3
Lmod
Runs a local Lmod system not connected to the Compute Canada Lmod.
To find a software module use 'module spider <software-name>'
e.g. hpc1006@cac155: module spider cuda --------------------------------------------------------------------------------------- cuda: cuda/10.1 --------------------------------------------------------------------------------------- Description: CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. hpc1006@cac155: module load cpu/10.1 hpc1006@cac155: module list Currently Loaded Modules: 1) cuda/10.1
Anaconda
There is a global install of Anaconda 3.7.3 installed. To use it, load it via lmod
i.e. hpc1006@cac155 [108]: module spider anaconda ---------------------------------------------- anaconda3: anaconda3/2019.3 ----------------------------------------------- Description: Anaconda Python distribution - version 3 - 2019-03 This module can be loaded directly: module load anaconda3/2019.3 hpc1006@cac155 [109]: module load anaconda3/2019.3 hpc1006@cac155 [110]: module list Currently Loaded Modules: 1) anaconda3/2019.3
IBM Watson Machine Learning Community Edition
The IBM PowerAI/Deep Learning frameworks is a collections of packages such as Tensorflow, Caffe and Pytorch. They are installed in the globally installed Anaconda3 and can be used by loading the anaconda3/2019.3 module
e.g.
module load anaconda3/2019.3 hpc1006@cac155 [107]: python Python 3.7.3 (default, Mar 27 2019, 22:31:02) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow 2019-11-21 15:57:02.005669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 >>>
Miniconda:
It's recommended for users to install their choice of miniconda in their home directory.
e.g. Choose version here, https://repo.continuum.io/miniconda/ Note: it must be a ppcle version
>wget https://repo.continuum.io/miniconda/Miniconda2-4.7.12.1-Linux-ppc64le.sh >bash Miniconda3-4.7.12.1-Linux-x86_64.sh - Answer 'yes' to accept agreement- - Choose install directory, default is $HOME/miniconda3 but you may choose to install it in a custom directory for power pc to avoid conflicting with any X86 installs. e.g. mkdir $HOME/ppc and use $HOME/ppc/miniconda3 as your install directory - Answer 'yes' to running conda init if you want the miniconda/bin directory in your $PATH
Submitting Jobs
The Power 8 GPU nodes use a resource manager called Slurm[1]. See the CAC wiki[[2]] for details. Note, currently due to conflicts with the non-powerpc architecture of the rest of the Frontenac cluster, your script must include
#SBATCH --export=none # clear your env vars #SBATCH --export LD_LIBRARY_PATH=/usr/lib64/nvidia # set Nvidia library path
Here's sample slurm sbatch script.
#!/bin/bash #SBATCH --export=none # clear your env vars #SBATCH --partition=power-gpu # specify the Power GPU nodes #SBATCH --account=<your-account-here> #SBATCH --qos=gpu # Specify the Quality of Service needed for using GPUs at CAC #SBATCH --gres=gpu:1 # Number of GPUs (per node) #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=500M # memory (per node) #SBATCH --time=0-01:00 # time (DD-H export LD_LIBRARY_PATH=/usr/lib64/nvidia # set Nvidia library path hostname # show node it ran on source /etc/profile.d/z-20-lmod.sh # load Lmod environment module spider cuda # search Lmod for CUDA modules module load cuda/10.1 # load CUDA module module list cd /<your_directory>/nvcctest # sample job rm /<your_directory>/nvcctest/hello # remove previous run nvcc hello.cu -o hello # see below for hello.cu /<your_directory>/nvcctest/hello # execute hello executable
hello.cu
#include <stdio.h> __global__ void print_kernel() { printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x); } int main() { print_kernel<<<10, 10>>>(); cudaDeviceSynchronize(); }