Hardware:SW

From CAC Wiki
Revision as of 19:18, 8 September 2016 by Hasch (Talk | contribs)

Jump to: navigation, search

The SW cluster is presently our main compute cluster. Note that we have undergone a major hardware upgrade and that large portions of these pages are subject to change. Please re-visit occasionally to keep abreast of this.

The SW (Linux) Cluster

Software Linux Cluster
Software Linux Cluster

The Centre for Advanced Computing operates a cluster of X86 based multicore machines running Linux.This page explains essential features of this cluster and is meant as a basic guide for its usage.

SW (Linux) Cluster Nodes (sw series)
Host CPU model Speed Cores Threads Memory
sw0013 Xeon X5675 3.07GHz 12 24 64 GB
sw0014 Xeon X5675 3.07GHz 12 24 64 GB
sw0015 Xeon X5675 3.07GHz 12 24 64 GB
sw0016 Xeon X5675 3.07GHz 12 24 64 GB
sw0017 Xeon X5675 3.07GHz 12 24 64 GB
sw0018 Xeon X5675 3.07GHz 12 24 64 GB
sw0019 Xeon X5675 3.07GHz 12 24 64 GB
sw0020 Xeon X5675 3.07GHz 12 24 64 GB
sw0021 Xeon X5675 3.07GHz 12 24 64 GB
sw0022 Xeon X5675 3.07GHz 12 24 64 GB
sw0023 Xeon X5675 3.07GHz 12 24 32 GB
sw0024 Xeon X5675 3.07GHz 12 24 32 GB
sw0025 Xeon X5675 3.07GHz 12 24 32 GB
sw0026 Xeon X5675 3.07GHz 12 24 32 GB
sw0027 Xeon X5675 3.07GHz 12 24 32 GB
sw0028 Xeon X5675 3.07GHz 12 24 32 GB
sw0029 Xeon X5675 3.07GHz 12 24 32 GB
sw0030 Xeon X5675 3.07GHz 12 24 32 GB
sw0031 Xeon X5675 3.07GHz 12 24 32 GB
sw0032 Xeon X5675 3.07GHz 12 24 32 GB
sw0033 Xeon X5675 3.07GHz 12 24 32 GB
sw0034 Xeon X5675 3.07GHz 12 24 32 GB
sw0035 Xeon X5670 2.93GHz 12 24 64 GB
sw0036 Xeon X5670 2.93GHz 12 24 64 GB
sw0037 Xeon X5670 2.93GHz 12 24 64 GB
sw0038 Xeon X5670 2.93GHz 12 24 64 GB
sw0039 Xeon X5670 2.93GHz 12 24 64 GB
sw0040 Xeon X5670 2.93GHz 12 24 64 GB
sw0041 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0042 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0043 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0044 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0045 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0046 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0047 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0048 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0049 Xeon E7- 4860 2.27GHz 40 80 256 GB
sw0050 Xeon E7- 4860 2.27GHz 40 80 1 TB
sw0051 Xeon E7- 4860 2.27GHz 40 80 1 TB
sw0052 Xeon E7- 8860 2.27GHz 80 160 512 GB
sw0053 Xeon E7- 8870 2.40GHz 80 160 512 GB
sw0054 Xeon E7- 8860 2.27GHz 80 160 512 GB
sw0055 Xeon X5680 3.33GHz 12 24 144 GB
sw0056 Xeon X5680 3.33GHz 12 24 144 GB
sw0057 Xeon X5680 3.33GHz 12 24 144 GB
SW (Linux) Cluster Nodes (cac series)
Host CPU model Speed Cores Threads Memory
cac011 E5-2650 2.2 GHz 24 48 256 GB
cac012 E5-2650 2.2 GHz 24 48 256 GB
cac013 E5-2650 2.2 GHz 24 48 256 GB
cac014 E5-2650 2.2 GHz 24 48 256 GB
cac015 E5-2650 2.2 GHz 24 48 256 GB
cac016 E5-2650 2.2 GHz 24 48 256 GB
cac017 E5-2650 2.2 GHz 24 48 256 GB
cac018 E5-2650 2.2 GHz 24 48 256 GB
cac019 E5-2650 2.2 GHz 24 48 256 GB
cac020 E5-2650 2.2 GHz 24 48 256 GB
cac021 E5-2650 2.2 GHz 24 48 256 GB
cac022 E5-2650 2.2 GHz 24 48 256 GB
cac023 E5-2650 2.2 GHz 24 48 256 GB
cac024 E5-2650 2.2 GHz 24 48 256 GB
cac025 E5-2650 2.2 GHz 24 48 256 GB
cac026 E5-2650 2.2 GHz 24 48 256 GB
cac027 E5-2650 2.2 GHz 24 48 256 GB
cac034 E5-2650 2.2 GHz 24 48 256 GB
cac035 E5-2650 2.2 GHz 24 48 256 GB
cac036 E5-2650 2.2 GHz 24 48 256 GB
cac037 E5-2650 2.2 GHz 24 48 256 GB
cac038 E5-2650 2.2 GHz 24 48 256 GB
cac039 E5-2650 2.2 GHz 24 48 256 GB
cac040 E5-2650 2.2 GHz 24 48 256 GB
cac041 E5-2650 2.2 GHz 24 48 256 GB
cac042 E5-2650 2.2 GHz 24 48 256 GB
cac043 E5-2650 2.2 GHz 24 48 256 GB
cac044 E5-2650 2.2 GHz 24 48 256 GB
cac045 E5-2650 2.2 GHz 24 48 256 GB
cac046 E5-2650 2.2 GHz 24 48 256 GB
cac047 E5-2650 2.2 GHz 24 48 256 GB
cac048 E5-2650 2.2 GHz 24 48 256 GB
cac049 E5-2650 2.2 GHz 24 48 256 GB
cac050 E5-2650 2.2 GHz 24 48 256 GB
cac051 E5-2650 2.2 GHz 24 48 256 GB
cac052 E5-2650 2.2 GHz 24 48 256 GB
cac053 E5-2650 2.2 GHz 24 48 256 GB
cac054 E5-2650 2.2 GHz 24 48 256 GB
cac055 E5-2650 2.2 GHz 24 48 256 GB
cac056 E5-2650 2.2 GHz 24 48 256 GB
cac057 E5-2650 2.2 GHz 24 48 256 GB
cac058 E5-2650 2.2 GHz 24 48 256 GB
cac059 E5-2650 2.2 GHz 24 48 256 GB
cac060 E5-2650 2.2 GHz 24 48 256 GB
cac061 E5-2650 2.2 GHz 24 48 256 GB
cac062 E5-2650 2.2 GHz 24 48 256 GB
cac063 E5-2650 2.2 GHz 24 48 256 GB
cac064 E5-2650 2.2 GHz 24 48 256 GB
cac065 E5-2650 2.2 GHz 24 48 256 GB
cac066 E5-2650 2.2 GHz 24 48 256 GB
cac067 E5-2650 2.2 GHz 24 48 256 GB
cac068 E5-2650 2.2 GHz 24 48 256 GB
cac069 E5-2650 2.2 GHz 24 48 256 GB

Type of Hardware

This cluster consists of X86 multicore nodes made by Dell, IBM, and Lenovo. All nodes run CentOS Linux and share a file system. Access is handled by Grid Engine. The server nodes are called sw0004...sw0059 and cac011...cac069.

  • Presently, the workup node of the HPCVL "Software Cluster" is swlogin1. This is a Dell PowerEdge R410 Server with 2 sockets with a 6-core Intel® Xeon® processor (Intel x5675) running at 3.1 GHz.
  • Some of the nodes in the SW cluster (sw0015-40) are Dell PowerEdge R410 Servers that have 2 sockets with a 6-core Intel Xeon processor (Intel x5670 / x5675) that runs at 2.9/3.07 GHz. These nodes offer a total of 12 cores that are 2-fold hyperthreaded, i.e. they support up to 24 threads. The scheduler is configured such that only 12 threads are run at a time. These nodes have 64 Gbyte (sw0015-22, sw0035-40) or 32 Gbyte (sw0023-34) of physical memory.
  • Some nodes (sw0041-51) are IBM XServers 3850-X5 that are also based on the Intel® Xeon® processor (Intel E7-4860). These servers have a total of 40 cores per node and support for up to 80 threads (hyperthreading). The clock speed for these machines is 2.27GHz. Two of these servers (sw0050-51) have a 1 TB of physical memory, the others have 256 GB.
  • Two of our nodes (sw0052,sw0054) are IBM Servers based on the Intel E7-8860 processors with 80 cores total (160 threads) running at 2.27 GHz, while another one (sw0053) with 80 cores (160 threads) uses the E7-8870 at 2.4 GHz. Each of the three have 512 GB of memory.
  • A further 5 nodes (cac028...33) were added at the same time (August 2016). These are of the Lenovo System x3950 x6 8-socket type with 8 x Intel E7-8867 v3 16-core processors at 2.5 GHz for a total of 128 cores (dually hyperthreaded). Each of these units has a total of 2 TB of memory. They are used for special applications that require high memory.

Why these Systems?

The main emphasis in these systems is a high floating-point performance for a modest number of processes / threads. Since commercial software such as Fluent and Abaqus offer support for Linux only, this cluster was originally acquired to offer recent versions of these software packages. In addition, the higher single-core performance of these nodes allows for an efficient use of license seats which usually a priced per-core.

Who Should Use This Cluster?

The software cluster runs on the Linux operating system and should be used by anyone who wants to run applications that are available on that platform. Runs that require more than 32 Gbyte of memory need to request this explicitly to avoid mis-scheduling.

We suggest you use this cluster if:

  • Your application is floating-point intensive with modest amounts of memory.
  • Your application is commercial or public-domain software that supports Linux.
  • Your application is explicitly parallel (for instance, using MPI) and has low communication requirements, or is multi-threaded with a small number (typically no more than 12) of scaling threads.
  • Your application uses a commercial license that is scaled per process.

This cluster might not be suitable if

  • You need to perform a large number of relatively short jobs, each serial or with very few threads.
  • Your application is very memory intensive. Long waiting time may be the consequence.
  • Your application is required to scale to a very large number of processes in a distributed-memory fashion and is communication intensive. Such jobs require a fast interconnect (Infiniband or similar) and should be run on a different system, for instance other Compute Canada installations.

If you think your application could run more efficiently on these machines, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.

Note that we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.

Using the Cluster

Access

ssh hpcXXXX@130.15.59.64
hpcXXXX@130.15.59.64's password: *****
hpcXXXX@sflogin0$ ssh swlogin1
hpcXXXX@swlogin1's password: ***** 

The file systems for all of our clusters are shared, so you will be using the same home directory as when you are using the M9000 servers or the standard login node sfnode0. swlogin1 can be used for compilation, program development, and testing only, not for production jobs.

Compiling Code

Intel Compiler Suite

The best compiler to use is the Intel Compiler Suite. This includes compilers for Fortran, C, and C++, as well as MPI and OpenMP support, debuggers and development suite. This software resides in /opt/ics. The versions are:

  • Fortran (ifort): Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
  • C (icc): Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
  • C++ (icpc): Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811

This compiler suite needs to be activated before use. The command is

use icsmpi

Gnu Compilers

In many cases, especially for public domain software, the preferable compiler is gnu C/C++/Fortran. The system version of these is:

Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
--disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile 
--enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib 
--with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)

No special activation is needed to use these, as they reside in a system director. A newer version of this compiler set is available in /opt/gcc-4.8.3 and can be access using the command

use gcc-4.8.3

If MPI is required, it can be loaded through

use openmpi

For applications that cannot be re-compiled (for instance, because the source code is not accessible), a pre-compiled Linux version (x64 for Redhat will do the trick) needs to be obtained.

Running Jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes or if interactive use is unavoidable. In the latter case, please get in touch to let us know what you need. Production jobs must be submitted through the Grid Engine load scheduler.

You need to add the following two lines to your script for your job to be scheduled to the Linux SW cluster exclusively:

#$ -q abaqus.q 
#$ -l qname=abaqus.q

The abaqus name for the queue that is added here derives from the initial software Abaqus that was (and still is) run on this cluster.

Note that your jobs will run on dedicated threads, i.e. typically up to 12 processes can be scheduled to a single node. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.

Help?

General information about using HPCVL facilities can be found in our FAQ pages. We also supply user support (please send email to help@hpcvl.org or contact us directly), so if you experience problems, we can assist you.