The SW (Linux) Cluster

Software Linux Cluster

The Centre for Advanced Computing operates a cluster of X86 based multicore machines running Linux.This page explains essential features of this cluster and is meant as a basic guide for its usage.

Type of Hardware

Our cluster consists of multiple X86 multicore nodes made by Dell and IBM (both based on Intel x5670 or E7-4860). All nodes run CentOS Linux and share a file system. Access is handled by Grid Engine. The server nodes are called sw0011...sw0054.

Presently, the workup node of the HPCVL "Software Cluster" is swlogin1. This is a Dell PowerEdge R410 Server with 2 sockets with a 6-core Intel® Xeon® processor (Intel x5675) running at 3.1 GHz.

Most of the nodes of the SW cluster (sw0015-40) are Dell PowerEdge R410 Servers that have 2 sockets with a 6-core Intel Xeon processor (Intel x5670 / x5675) that runs at 2.9/3.07 GHz. These nodes offer a total of 12 cores that are 2-fold hyperthreaded, i.e. they support up to 24 threads. The scheduler is configured such that only 12 threads are run at a time. These nodes have 64 Gbyte (sw0015-22, sw0035-40) or 32 Gbyte (sw0023-34) of physical memory.

A few nodes (sw0041-51) are IBM XServers 3850-X5 that are also based on the Intel® Xeon® processor (Intel E7-4860). These servers have a total of 40 cores per node and support for up to 80 threads (hyperthreading). The clock speed for these machines is 2.27GHz. Two of these servers (sw0050-51) have a 1 TB of physical memory, the others have 256 GB.

Finally, two of our nodes (sw0052,sw0054) are IBM Servers based on the Intel E7-8860 processors with 80 cores total (160 threads) running at 2.27 GHz, while another one (sw0053) with 80 cores (160 threads) uses the E7-8870 at 2.4 GHz. Each of the three have 512 GB of memory.

SW (Linux) Cluster Nodes
Host	CPU model	Speed	Cores	Threads	Memory
sw0013	Xeon X5675	3.07GHz	12	24	64 GB
sw0014	Xeon X5675	3.07GHz	12	24	64 GB
sw0015	Xeon X5675	3.07GHz	12	24	64 GB
sw0016	Xeon X5675	3.07GHz	12	24	64 GB
sw0017	Xeon X5675	3.07GHz	12	24	64 GB
sw0018	Xeon X5675	3.07GHz	12	24	64 GB
sw0019	Xeon X5675	3.07GHz	12	24	64 GB
sw0020	Xeon X5675	3.07GHz	12	24	64 GB
sw0021	Xeon X5675	3.07GHz	12	24	64 GB
sw0022	Xeon X5675	3.07GHz	12	24	64 GB
sw0023	Xeon X5675	3.07GHz	12	24	32 GB
sw0024	Xeon X5675	3.07GHz	12	24	32 GB
sw0025	Xeon X5675	3.07GHz	12	24	32 GB
sw0026	Xeon X5675	3.07GHz	12	24	32 GB
sw0027	Xeon X5675	3.07GHz	12	24	32 GB
sw0028	Xeon X5675	3.07GHz	12	24	32 GB
sw0029	Xeon X5675	3.07GHz	12	24	32 GB
sw0030	Xeon X5675	3.07GHz	12	24	32 GB
sw0031	Xeon X5675	3.07GHz	12	24	32 GB
sw0032	Xeon X5675	3.07GHz	12	24	32 GB
sw0033	Xeon X5675	3.07GHz	12	24	32 GB
sw0034	Xeon X5675	3.07GHz	12	24	32 GB
sw0035	Xeon X5670	2.93GHz	12	24	64 GB
sw0036	Xeon X5670	2.93GHz	12	24	64 GB
sw0037	Xeon X5670	2.93GHz	12	24	64 GB
sw0038	Xeon X5670	2.93GHz	12	24	64 GB
sw0039	Xeon X5670	2.93GHz	12	24	64 GB
sw0040	Xeon X5670	2.93GHz	12	24	64 GB
sw0041	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0042	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0043	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0044	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0045	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0046	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0047	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0048	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0049	Xeon E7- 4860	2.27GHz	40	80	256 GB
sw0050	Xeon E7- 4860	2.27GHz	40	80	1 TB
sw0051	Xeon E7- 4860	2.27GHz	40	80	1 TB
sw0052	Xeon E7- 8860	2.27GHz	80	160	512 GB
sw0053	Xeon E7- 8870	2.40GHz	80	160	512 GB
sw0054	Xeon E7- 8860	2.27GHz	80	160	512 GB
sw0055	Xeon X5680	3.33GHz	12	24	144 GB
sw0056	Xeon X5680	3.33GHz	12	24	144 GB
sw0057	Xeon X5680	3.33GHz	12	24	144 GB

Why these Systems?

The main emphasis in these systems is a high floating-point performance for a modest number of processes / threads. Since commercial software such as Fluent and Abaqus is increasingly focussed on support for Linux only, this cluster was acquired to continue to offer recent versions of these software packages. In addition, the higher single-core performance of these nodes (compared to the Sparc/Solaris based M9000 cluster, for instance), allows for a more efficient use of license seats which usually a priced per-core.

Who Should Use This Cluster?

The software cluster runs on the Linux operating system, and should therefore only be used if the software cannot be compiled or run on the Sparc/Solaris platform. Runs that require more than 64 Gbyte of memory should be performed on the M9000 cluster unless the program is parallelized using MPI with distributed memory and very low communication requirements.

We suggest you consider using this compute cluster if

Your application is very floating-point intensive and has little else to do, and you need modest amounts of memory. With few exceptions no more than 64 Gbyte are available in the form of shared memory.

Your application is commercial or public-domain software that supports only Linux or poses considerable challenges to port to a Solaris platform.

Your application is either explicitly parallel (for instance, using MPI) and has very low communication requirements, or is multi-threaded with a small number (typically no more than 12) of scaling threads.

Your application uses a commercial license that is scaled per process; in such cases it is favourable to use machines with the maximum per-process performance.

This cluster might not be suitable if

You need to perform a large number of relatively short jobs, each serial or with very few threads. Jobs like this should be sent to the "Victoria Falls" cluster.

Your application is memory intensive and/or compiles and runs well on the Solaris/Sparc platform. Such jobs should be sent to the default M9000 cluster.

Your application is required to scale to a very large number of processes in a distributed-memory fashion and is communication intensive. Such jobs require a fast interconnect (Infiniband or similar) and should be run on a different system, for instance other Compute Canada installations.

If you think your application could run more efficiently on these machines, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.

Note that on these cluster (as on the M9000's), we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.

How Do I Use This Cluster?

... to access

The Secure Portal offers a direct link called xterm (linux login node). This link connects via a terminal to swlogin1 which is designated as a login/workup node for the cluster. If encounter issues with the portal login please let us know. Meanwhile, it is possible to "ssh" directly from sflogin0 to swlogin1 by typing

ssh sw0010

and re-typing your system password.

The file systems for all of our clusters are shared, so you will be using the same home directory as when you are using the M9000 servers or the standard login node sfnode0. swlogin1 can be used for compilation, program development, and testing only, not for production jobs.

... to compile and link

Since the SW cluster has a completely different architecture than the M9000 Servers code must be re-compiled when migrating to this cluster. The compiler that we are using on this cluster is the Intel Compiler Suite. This includes compilers for Fortran, C, and C++, as well as MPI and OpenMP support, debuggers and development suite. This software resides in /opt/ics and is only visible to the Linux cluster. The versions are:

Fortran (ifort): Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
C (icc): Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
C++ (icpc): Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811

This compiler suite needs to be activated before use. The command is

use ics

In many cases, especially when software from the public domaine is involved, the preferable compilers are the gnu C/C++/Fortran compilers. The system version of these is:

Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
--disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile 
--enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib 
--with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)

No special activation is needed to use these, as they reside in a system director. A newer version of this compiler set is available in /opt/gcc-4.8.3 and can be access using the command

use gcc-4.8.3

For applications that cannot be re-compiled (for instance, because the source code is not accessible), a pre-compiled Linux version (x64 for Redhat will do the trick) needs to be obtained.

... to run jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes or if interactive use is unavoidable. In the latter case, please get in touch to let us know what you need. Production jobs must be submitted through the Grid Engine load scheduler. For a description of how to use Grid Engine, see the HPCVL GridEngine FAQ.

Grid Engine will schedule jobs to a default pool of machines unless otherwise stated. This default pool contains presently only the M9000 nodes m9k0001-8. Therefore, you need to add the following two lines to your script for your job to be scheduled to the Linux SW cluster exclusively:

#$ -q abaqus.q

#$ -l qname=abaqus.q

The abaqus name for the queue that is added here derives from the initial software Abaqus that was (and still is) run on this cluster.

Note that your jobs will run on dedicated threads, i.e. typically up to 12 processes can be scheduled to a single node. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.

Help?

General information about using HPCVL facilities can be found in our FAQ pages. We also supply user support (please send email to help@hpcvl.org or contact us directly), so if you experience problems, we can assist you.

Hardware:SW

Contents

The SW (Linux) Cluster

Type of Hardware

Why these Systems?

Who Should Use This Cluster?

How Do I Use This Cluster?

... to access

... to compile and link

... to run jobs

Help?

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools