Difference between revisions of "Hardware:SW"

Latest revision as of 13:36, 19 January 2018

The SW cluster has been decomissioned. Please refer to the Frontenac Cluster

The SW (Linux) Cluster

The Centre for Advanced Computing operates a cluster of X86 based multicore machines running Linux.This page explains essential features of this cluster and is meant as a basic guide for its usage.

SW (Linux) Cluster Nodes ("old" sw series)
Host	CPU model	Speed	Cores	Threads	Memory
sw0044	Xeon E7-4860	2.3GHz	40	80	256 GB
sw0045	Xeon E7-4860	2.3GHz	40	80	256 GB
sw0046	Xeon E7-4860	2.3GHz	40	80	256 GB
sw0047	Xeon E7-4860	2.3GHz	40	80	256 GB
sw0048	Xeon E7-4860	2.3GHz	40	80	256 GB
sw0049	Xeon E7-4860	2.3GHz	40	80	256 GB
Software (SW) Linux Cluster

SW (Linux) Cluster Nodes ("new" cac series)
Host	CPU model	Speed	Cores	Threads	Memory
cac019	E7-4860	2.3 GHz	40	80	256 GB
cac020	E7-4830 v3	2.1 GHz	48	96	1.2 TB
cac021	E7-4830 v3	2.1 GHz	48	96	1.2 TB
cac022	E7-8860	2.3 GHz	80	160	512 GB
cac023	E7-8860	2.4 GHz	80	160	512 GB
cac024	E7-8860	2.4 GHz	80	160	512 GB
cac025	E7-4860	2.3 GHz	40	80	1 TB
cac026	E7-4860	2.3 GHz	40	80	1 TB
cac027	E7-8850 v2	2.3 GHz	48	96	256 GB
cac028	E7-8867 v3	2.5 GHz	128	256	2 TB
cac028	E7-8867 v3	2.5 GHz	128	256	2 TB
cac029	E7-8867 v3	2.5 GHz	128	256	2 TB
cac030	E7-8867 v3	2.5 GHz	128	256	2 TB
cac032	E7-8867 v3	2.5 GHz	128	256	2 TB
cac033	E7-8867 v3	2.5 GHz	128	256	2 TB
cac034	E5-2650 v4	2.2 GHz	24		256 GB
cac035	E5-2650 v4	2.2 GHz	24		256 GB
cac036	E5-2650 v4	2.2 GHz	24		256 GB
cac037	E5-2650 v4	2.2 GHz	24		256 GB
cac038	E5-2650 v4	2.2 GHz	24		256 GB
cac039	E5-2650 v4	2.2 GHz	24		256 GB
cac040	E5-2650 v4	2.2 GHz	24		256 GB
cac041	E5-2650 v4	2.2 GHz	24		256 GB
cac042	E5-2650 v4	2.2 GHz	24		256 GB
cac043	E5-2650 v4	2.2 GHz	24		256 GB
cac044	E5-2650 v4	2.2 GHz	24		256 GB
cac045	E5-2650 v4	2.2 GHz	24		256 GB
cac046	E5-2650 v4	2.2 GHz	24		256 GB
cac047	E5-2650 v4	2.2 GHz	24		256 GB
cac048	E5-2650 v4	2.2 GHz	24		256 GB
cac049	E5-2650 v4	2.2 GHz	24		256 GB
cac050	E5-2650 v4	2.2 GHz	24		256 GB
cac051	E5-2650 v4	2.2 GHz	24		256 GB
cac052	E5-2650 v4	2.2 GHz	24		256 GB
cac053	E5-2650 v4	2.2 GHz	24		256 GB
cac054	E5-2650 v4	2.2 GHz	24		256 GB
cac055	E5-2650 v4	2.2 GHz	24		256 GB
cac056	E5-2650 v4	2.2 GHz	24		256 GB
cac057	E5-2650 v4	2.2 GHz	24		256 GB
cac058	E5-2650 v4	2.2 GHz	24		256 GB
cac059	E5-2650 v4	2.2 GHz	24		256 GB
cac060	E5-2650 v4	2.2 GHz	24		256 GB
cac061	E5-2650 v4	2.2 GHz	24		256 GB
cac062	E5-2650 v4	2.2 GHz	24		256 GB
cac063	E5-2650 v4	2.2 GHz	24		256 GB
cac064	E5-2650 v4	2.2 GHz	24		256 GB
cac065	E5-2650 v4	2.2 GHz	24		256 GB
cac066	E5-2650 v4	2.2 GHz	24		256 GB
cac067	E5-2650 v4	2.2 GHz	24		256 GB
cac068	E5-2650 v4	2.2 GHz	24		256 GB
cac069	E5-2650 v4	2.2 GHz	24		256 GB
cac070	E5-2650 v4	2.2 GHz	24		256 GB
cac071	E5-2650 v4	2.2 GHz	24		256 GB
cac072	E5-2650 v4	2.2 GHz	24		256 GB
cac073	E5-2650 v4	2.2 GHz	24		256 GB
cac074	E5-2650 v4	2.2 GHz	24		256 GB
cac075	E5-2650 v4	2.2 GHz	24		256 GB
cac076	E5-2650 v4	2.2 GHz	24		256 GB
cac077	E5-2650 v4	2.2 GHz	24		256 GB
cac078	E5-2650 v4	2.2 GHz	24		256 GB
cac079	E5-2650 v4	2.2 GHz	24		256 GB
cac080	E5-2650 v4	2.2 GHz	24		256 GB
cac081	E5-2650 v4	2.2 GHz	24		256 GB
cac082	E5-2650 v4	2.2 GHz	24		256 GB
cac083	E5-2650 v4	2.2 GHz	24		256 GB
cac084	E5-2650 v4	2.2 GHz	24		256 GB
cac085	E5-2650 v4	2.2 GHz	24		256 GB
cac086	E5-2650 v4	2.2 GHz	24		256 GB
cac087	E5-2650 v4	2.2 GHz	24		256 GB
cac088	E5-2650 v4	2.2 GHz	24		256 GB
cac089	E5-2650 v4	2.2 GHz	24		256 GB
cac090	E5-2650 v4	2.2 GHz	24		256 GB
cac091	E5-2650 v4	2.2 GHz	24		256 GB
cac092	E5-2650 v4	2.2 GHz	24		256 GB
cac093	E5-2650 v4	2.2 GHz	24		256 GB

Type of Hardware

This cluster consists of X86 multicore nodes made by Lenovo and IBM. All nodes run CentOS Linux and share a file system. Access is handled by Grid Engine. The server nodes are called cac019...cac099.

Presently, the workup node of the "Software Cluster" is swlogin1. This is a Dell PowerEdge R410 Server with 2 sockets with a 6-core Intel® Xeon® processor (Intel x5675) running at 3.1 GHz.

Most nodes on the cluster were built by Lenovo and are of the Lenovo NeXtScale nx360 M5 type. These are based on 2 Intel Xeon E5-2650 12-core CPUs that run at 2.2 GHz, for a total of 24 cores per node.

Larger high-memory nodes were added at the same time (August 2016). These are of the Lenovo System x3950 x6 8-socket type with 8 x Intel E7-8867 v3 16-core processors at 2.5 GHz for a total of 128 cores (dually hyperthreaded). Each of these units has a total of 2 TB of memory. They are used for special applications that require high memory.

Some older nodes are IBM XServers 3850-X5 that are also based on the Intel® Xeon® processor (Intel E7-4860). These servers have a total of 40 cores per node and support for up to 80 threads (hyperthreading). The clock speed for these machines is 2.27GHz. Two of these servers (sw0050-51) have a 1 TB of physical memory, the others have 256 GB.

A few of our nodes are IBM Servers based on the Intel E7-8860 processors with 80 cores total (160 threads) running at 2.27 GHz, while another one (sw0053) with 80 cores (160 threads) uses the E7-8870 at 2.4 GHz. Each of these have 512 GB of memory.

Why these Systems?

The main emphasis in these systems is a high floating-point performance for a modest number of processes / threads. Since commercial software such as Fluent and Abaqus offer support for Linux only, this cluster was originally acquired to offer recent versions of these software packages. In addition, the higher single-core performance of these nodes allows for an efficient use of license seats which usually a priced per-core.

Who Should Use This Cluster?

The software cluster runs on the Linux operating system and should be used by anyone who wants to run applications that are available on that platform. Runs that require more than 32 Gbyte of memory need to request this explicitly to avoid mis-scheduling.

We suggest you use this cluster if:

Your application is floating-point intensive with modest amounts of memory.

Your application is commercial or public-domain software that supports Linux.

Your application is explicitly parallel (for instance, using MPI) and has low communication requirements, or is multi-threaded with a small number (typically no more than 12) of scaling threads.

Your application uses a commercial license that is scaled per process.

This cluster may not be suitable if:

Your application is very memory intensive. Long waiting time may be the consequence.

Your application is required to scale to a very large number of processes in a distributed-memory fashion and is communication intensive. Such jobs require a fast interconnect (Infiniband or similar) and should be run on a different system, for instance other Compute Canada installations.

If you think your application could run more efficiently on these machines, please contact us (cac.help@queensu.ca) to discuss any concerns and let us assist you in getting started.

Note that we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.

Using the Cluster

Access

Indirectly through ssh from sflogin0:

ssh hpcXXXX@130.15.59.64
hpcXXXX@130.15.59.64's password: *****
hpcXXXX@sflogin0$ ssh swlogin1
hpcXXXX@swlogin1's password: *****

The file systems for all of our clusters are shared, so you will be using the same home directory as when you are using the M9000 servers or the standard login node sfnode0. swlogin1 can be used for compilation, program development, and testing only, not for production jobs.

Compiling Code

Intel Compiler Suite

The best compiler to use is the Intel Compiler Suite. This includes compilers for Fortran, C, and C++, as well as MPI and OpenMP support, debuggers and development suite. This software resides in /opt/ics. The versions are:

Fortran (ifort): Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
C (icc): Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811
C++ (icpc): Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 20110811

This compiler suite needs to be activated before use. The command is

use icsmpi

Gnu Compilers

In many cases, especially for public domain software, the preferable compiler is gnu C/C++/Fortran. The system version of these is:

Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
--disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile 
--enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib 
--with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)

No special activation is needed to use these, as they reside in a system director. A newer version of this compiler set is available in /opt/gcc-4.8.3 and can be access using the command

use gcc-4.8.3

If MPI is required, it can be loaded through

use openmpi

For applications that cannot be re-compiled (for instance, because the source code is not accessible), a pre-compiled Linux version (x64 for Redhat will do the trick) needs to be obtained.

Running Jobs

As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes or if interactive use is unavoidable. In the latter case, please get in touch to let us know what you need. Production jobs must be submitted through the Grid Engine load scheduler.

The name for the SGE queue that schedules to this cluster is abaqus.q. This does not have to be specified as it is the default. The abaqus name for the queue derives from the initial software Abaqus that was (and still is) run on this cluster.

Note that your jobs will run on dedicated threads, i.e. typically up to 12 processes can be scheduled to a single node. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.

Help?

General information about using CAC facilities can be found in our FAQ pages. We also supply user support (please send email to cac.help@queensu.ca or contact us directly), so if you experience problems, we can assist you.

@@ Line 1: / Line 1: @@
 {|  style="border-spacing: 8px;"
 | valign="top" width="50%" style="padding:1em; border:1px solid #fa5882; background-color:#f6eee3; border-radius:7px" |
-'''The SW cluster is presently our main compute cluster. Note that we have undergone a major hardware upgrade and that large portions of these pages are subject to change. Please re-visit occasionally to keep abreast of this.'''
+'''The SW cluster has been decomissioned. Please refer to the [[Hardware:Frontenac|Frontenac Cluster]]'''
 <center>
 |}
@@ Line 602: / Line 602: @@
 | 256 GB
 |-
-| cac094
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
-|-
-| cac095
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
-|-
-| cac096
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
-|-
-| cac097
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
-|-
-| cac098
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
-|-
-| cac099
-| E5-2650 v4
-| 2.2 GHz
-| 24
-|
-| 256 GB
 |}
@@ Line 682: / Line 641: @@
 * Your application is required to scale to a very large number of processes in a distributed-memory fashion and is communication intensive. Such jobs require a fast interconnect (Infiniband or similar) and should be run on a different system, for instance other Compute Canada installations.
-If you think your application could run more efficiently on these machines, please contact us (help@hpcvl.org) to discuss any concerns and let us assist you in getting started.
+If you think your application could run more efficiently on these machines, please contact us (cac.help@queensu.ca) to discuss any concerns and let us assist you in getting started.
 Note that we have to enforce dedicated cores or CPUs to avoid sharing and context switching overheads. No "overloading" can be allowed.
@@ Line 693: / Line 652: @@
 === Access ===
-* Directly through the '''xterm (linux login node)''' application from the [https://portal.hpcvl.queensu.ca Secure Global Desktop (portal)].
 * Indirectly through '''ssh from sflogin0''':
 <pre>ssh hpcXXXX@130.15.59.64
@@ Line 744: / Line 702: @@
 As mentioned earlier, program runs for user and application software on the login node are allowed only for test purposes or if interactive use is unavoidable. In the latter case, please get in touch to let us know what you need. Production jobs must be submitted through the [[HowTo:Scheduler|Grid Engine load scheduler]].
-You need to add the following two lines to your script for your job to be scheduled to the Linux SW cluster exclusively:
+The name for the SGE queue that schedules to this cluster is '''abaqus.q'''. This does not have to be specified as it is the default.
+The abaqus name for the queue derives from the initial software Abaqus that was (and still is) run on this cluster.
-<pre>
-#$ -q abaqus.q
-#$ -l qname=abaqus.q
-</pre>
-The abaqus name for the queue that is added here derives from the initial software Abaqus that was (and still is) run on this cluster.
 Note that your jobs will run on dedicated threads, i.e. typically up to 12 processes can be scheduled to a single node. The Grid Engine will do the scheduling, i.e. there is no way for the user to determine which processes run on which cores.
@@ Line 757: / Line 709: @@
 ===Help?===
-General information about using HPCVL facilities can be found in our FAQ pages. We also supply user support (please [mailto:help@hpcvl.org send email to help@hpcvl.org] or [[Contacts:UserSupport|contact us directly]]), so if you experience problems, we can assist you.
+General information about using CAC facilities can be found in our FAQ pages. We also supply user support (please [mailto:cac.help@queensu.ca send email to cac.help@queensu.ca] or [[Contacts:UserSupport|contact us directly]]), so if you experience problems, we can assist you.

Difference between revisions of "Hardware:SW"

Latest revision as of 13:36, 19 January 2018

Contents

The SW (Linux) Cluster

Type of Hardware

Why these Systems?

Who Should Use This Cluster?

Using the Cluster

Access

Compiling Code

Intel Compiler Suite

Gnu Compilers

Running Jobs

Help?

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools