HowTo:gaussian:release

Gaussian 16 Revision A.03 Release Notes

Usage Notes:

1. Use of the current generation of NVIDIA GPUs is supported for Hartree-Fock
and DFT calculations. Refer to the performance notes for details.

2. Parallel performance on larger numbers of processors has been improved.
Refer to the performance notes for details and how to get optimal
performance on multiple CPUs, clusters, and GPUs.

3. The parameters which are specified in link-0 (%) input lines and/or
in a Default.Route file can now also be specified via either command-line
arguments or environment variables. Details are in the input notes.

4. There are new tools for interfacing Gaussian with other programs both
in compiled languages such as Fortran and C and with interpreted
languages such as Python and Perl. Refer to the interfacing notes
for details.

Changes in Defaults between Gaussian 09 and Gaussian 16:

1. The follow defaults are different in Gaussian 16:

a. Integral accuracy is 10^-12 rather than 10^-10 in G09

b. The default DFT grid for general use is UltraFine rather than
FineGrid in G09; the default grid for CPHF is SG1 rather than
CoarseGrid.
c. SCRF defaults to the symmetric form of IEFPCM (not present in G09)
rather than the non-symmetric version.

d. Physical constants use the 2010 values rather than the 2006 values
in G09.

The G09Defaults route keyword sets the defaults back to the G09 values
for compatibility with previous calculations, but the new defaults are
strongly recommended for new studies.

2. Gaussian 16 defaults memory usage to %mem=100mw. Even larger values are
appropriate for calculations on larger molecules and when using many
processors; refer to the performance notes for details.

3. TDDFT frequency calculations do analytic second derivatives by default,
since these are much faster than the numerical derivatives which were
the only choice in G09.

------------------------------------------------------------------------

Building Gaussian 16 from Source Code

1. By default, the build on x86 and x86_64 machines builds for the
current CPU type (or the closest supported CPU), uses the current
level 3 cache size and does not include support for GPUs. A
specific CPU type can be specified as an argument to the build
script, e.g.
bsd/bldg16 all sandybridge

to build executables compiled for Intel sandybridge and later processors
even if the current machine is a different type of x86_64. Similarly,

bsd/bldg16 all gpu

will build with NVIDIA k40 and k80 GPU support and the current type of
x86_64 processor or use

bsd/bldg16 all gpu sandybridge

to turn on both gpu support and a particular CPU type.

By default, the build scripts check the size of level 3 cache on
each chip and the number of CPUs per chip and set the parameter
for the amount of cache each thread should try to use as
L3 size divided by number of CPUs. If the executables are being built on
one type of systems but will be used primarily on a different
machine, then before building one should set the CACHESIZE environment
variable to the amount of cache in bytes that each CPU/thread should
use.

On x86_64 machines the supported machine types are:

ia32p4 (32-bit pentium)
amd64 (legacy AMD machines)
em64t (legacy Intel x86_64 machines)
istanbul (less old AMD machines)
nehalem (less old Intel machines)
sandybridge (Intel Sandybridge machines)
haswell (Intel Haswell and Broadwell machines)

For IBM Power 8 under Linux the option is:
ibmp8le (Little-endian Power 8 linux)
(The default is Big-endian Linux)

For NEC the option is:
necsxace (Ace machine)

2. Building Gaussian 16 with Linda requires Linda version 9.0; the
executables will build but will not function with previous version of Linda.
The linda 9.0 file should be un-tarred into the g16 directory, then the
bsd/fixlinda script should be run to set paths and soft links appropriately
and then the linda executables can be built with "mg linda".

3. Building on Intel Macs requires a case-sensitive file system.

------------------------------------------------------------------------

Input Notes

Most options that control how Gaussian 16 operates can be specified in
any of 4 ways. From highest to lowest precedence these are:

1. As link0 input (%-lines). This is the usual method to control
a specific job and the only way to control a specific step
within a multi-step input file.

2. As options on the command line. This is useful when one wants
aliases or other shortcuts for different common ways of running
the program.

3. As environment variables. This is most useful in standard scripts,
for example for generating and submitting jobs to batch queuing
systems.

4. As lines in a Default.Route file. This is most useful when one
wants to change the program defaults for all jobs.

When searching for a Default.Route file the current default directory
is checked first and then the directories in the path for G16
executables (environment variable GAUSS_EXEDIR, which normally
points to $g16root/g16).

The parameters which control defaults for Gaussian jobs are:

Input line Command Line Environment Variable Default.Route Meaning
---------- ------------ -------------------- ------------- -------

%cpu=... -c="..." GAUSS_CDEF "..." -C- ... which CPUs to use
%gpucpu=... -g="..." GAUSS_GDEF "..." -G- ... which GPUs to use and bind to which CPUs.
%usersh -s="rsh" GAUSS_SDEF "rsh" -S- rsh Linda should use rsh to start workers
%usessh -s="ssh" GAUSS_SDEF "ssh" -S- ssh Linda should use ssh to start workers
%lindaworkers=... -w="..." GAUSS_WDEF "..." -W- ... which nodes to use with Linda
-r="..." GAUSS_RDEF "..." -R- ... defaults for route
-h="..." GAUSS_HDEF "..." -H- ... hostname for archive entry
-o="..." GAUSS_ODEF "..." -O- ... organization/site for archive entry

The parameters which control defaults for Gaussian utility programs are:

GAUSS_FDEF "..." -F- ... default type for formchk (-c, -3, etc.)
GAUSS_UDEF "..." -U- ... default memory size for utilities.

There are also parameters which are primarily useful when running
Gaussian from scripts or external programs:

# -x="..." GAUSS_XDEF "..." complete route for job (no route section
will be read from the input file).
%chk= -y="..." GAUSS_YDEF "..." checkpoint file for job.
%rwf= -z="..." GAUSS_ZDEF "..." read-write file for job.

Note that the quotation marks are normally required for the command
line and environment variables, to avoid modification of the
parameter string by the shell.

The deprecated parameters %nprocshared and %nproclinda can also be
defaulted (flagged by P and L, respectively).

------------------------------------------------------------------------

Parallel Usage and Performance Notes

I. Shared-memory parallelism

1. Calculations involving larger molecules and basis sets benefit
from larger memory allocations. 4GBytes or more per processor is
recommended for calculations involving 50 or more atoms and/or 500
or more basis functions. The freqmem utility estimates the
optimal memory size per thread for ground-state frequency
calculations and the same value is reasonable for excited-state
frequencies and is more than sufficient for ground and excited
state optimizations.

The amount of memory allowed should rise with the number of
processors: if 4GByte is reasonable for one processor, then
the same job using 8 CPUs would run well in 32 GBytes. Of
course, there may be limitations to smaller values imposed by
the particular hardware, but scaling memory linearly with
number of CPUs should be the goal. In particular, increasing
only the number of CPUs with fixed memory size is unlikely to
lead to good performance when using large numbers of processors.

For large frequency calculations and for large CCSD and EOM-CCSD
energies, it is also desirable to leave enough memory to buffer
the large disk files involved. So a Gaussian job should only be
given 50-70% of the total memory on the system. For example, on
a machine with a total of 128 GBytes, one should typically give
64-80 GBytes to a job which was using all the CPUs and leave the
remaining memory for the operating system to use as disk cache.

2. Efficiency is lost when threads are moved from one CPU to another,
thereby invalidating the cache and causing other overhead. On
most machines Gaussian can tie threads to specific CPUs and this
is the recommended mode of operation, especially when using larger
numbers of processors. The %cpu link0 line specifies the numbers
of specific CPUs to be used. Thus on a machine with one 8-core
chip one should use %cpu=0-7 rather than %nproc=8 because the
former ties the first thread to CPU 0, the next to CPU 1, etc.

On some older Intel processors (Nehalem and before) there is not
enough memory bandwidth to keep all the CPUs on a chip busy and
it is often preferable to use half the CPUs, each with twice
as much memory as if all were used. For example, on such a
machine with 4 12-core chips and 128 GBytes of memory, with
CPUs 0-11 on the first chip, 12-23 on the second, etc., it is
better to run using 24 processors (6 on each chip) and give them
72/24=3GByte memory each, rather than use all 48 with only 1.5GBytes
of memory each. The input would be

%mem=72GB
%cpu=0-47/2

where the /2 means to use every other core, i.e. cores 0, 2, 4, 6,
8, and 10 (on chip 0), 12, 14, 16, 18, 20, and 22 (on chip 1), etc.

With the most recent generations of Intel processors (Haswell and
later) the memory bandwidth is better and using all the cores
on each chip works well.

3. As long as sufficient memory is available and threads are tied to
specific cores, then parallel efficiency on large molecules is good
up to 64 or more cores.
4. Hyperthreading is not useful for Gaussian since it effectively divides
resources such as memory bandwidth among threads on the same physical
CPU. If hyperthreading cannot be turned off, Gaussian jobs should
use only one hyperthread on each physical CPU. Under Linux, hyperthreads
on different processors are grouped together. That is, if a machine
has 2 chips each with 8 cores and 3-way hyperthreading, then "CPUs"
0-7 are across the 8 cores on chip 0, 8-15 are across the 8 cores on
chip 1, then 16-23 are the second hyperthreads on the 8 cores of chip 0,
etc. So a job would run best with %cpu=0-15.

Under AIX, hyperthreads are grouped together with up 8
hyperthread numbers for each CPU even if fewer hyperthreads
are in use, so with 2 8 core chips and 4-way hyperthreading,
"CPUs" 0-3 are all on core 0 of chip 0, 8-11 are on core 1 of
chip 0, etc. So one would want to use %cpu=0-127/8 to select
"CPUs" 0, 8, 16, etc. which are each using a distinct core.

II. Cluster (Linda) parallelism

1. Hartree-Fock and DFT energies, gradients and frequencies run
in parallel across clusters as do MP2 energies and gradients.
MP2 frequencies, CCSD, and EOM-CCSD energies and optimizations
are SMP parallel but not cluster parallel. Numerical
derivatives such as DFT Anharmonic frequencies and CCSD
Frequencies, are parallelized across nodes of a cluster by
doing complete gradient or second derivative calculation on
each node, splitting the directions of differentiation across
workers in the cluster.
2. Shared-memory and cluster parallelism can be combined.
Generally, one uses shared-memory parallelism across all CPUs
in each node of the cluster. Note that %cpu and %mem apply to
each node of the cluster. Thus if one has 3 nodes names a, b,
and c, each with 2 chips which have 8 CPUs each, then one
might specify

%mem=64gb
%cpu=0-15
%lindaworkers=a,b,c
#p b3lyp/6-31g* freq

This would run 16 threads, each pinned to a CPU, on each of the 3 nodes,
giving 4Gbytes to each of the 48 threads.

3. For the special case of numerical differentiation (Freq=Anharm, CCSD Freq, etc.)
only, one extra worker is used to collect the results. So these jobs should be run
with two workers on the master node (where g16 is started). For the above
example if the job was doing anharmonic frequencies, then one would do

%mem=64gb
%cpu=0-15
%lindaworkers=a:2,b,c
#p b3lyp/6-31g* freq=anharm

where g16 is assumed to be started on node a. This will start 2 workers on
node a, one of which just collects results, and will do the computational
work using the other worker on a and those on b and c.
III. Using GPUs.

1. Gaussian 16 can use NVIDIA K40 and K80 GPUs under Linux. Earlier GPUs do
not have the computational capabilities or memory size to run the algorithms
in G16. Allowing larger amounts of memory is even more important when using
GPUs than for CPUs, since larger batches of work must be done at the same
time in order to use the GPUs efficiently. The K40 and K80 can have up to
16 GBytes of memory and one typically tries to have most of this available
for Gaussian, which requires at least an equal amount of memory for the CPU
thread which is running each GPU. 8 or 9 GBytes works well if there is
12 GByte total on each GPU, or 11-12 GBytes for a 16GByte GPU.

2. When using GPUs it is essential to have the GPU controlled by a specific
CPU and much preferable if the CPU is physically close to the GPU it is
controlling. The hardware arrangement can be checked using the nvidia-smi
utility. For example, this output is for a machine with 2 16-core Haswell
CPU chips and 4 K80 boards, each of which has two GPUs:

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15
GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15
GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31
GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31
GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31
GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31
GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31
GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31

The important part is the CPU affinity. This shows that GPUs 0 and 1 (on the
first K80 card) are connected to the CPUs on chip 0 while GPUs 2-7 (on the
other three K80 cards) are connected to the CPUs on chip 1. So a job which
uses all the CPUs (24 CPUs doing parts of the computation and 8 controlling
GPUs) would use input

%cpu=0-31
%gpucpu=0-7=0-1,16-21

or equivalently but more verbosely

%cpu=0-31
%gpucpu=0,1,2,3,4,5,6,7=0,1,16,17,18,19,20,21

This pins threads 0-31 to CPUs 0-31 and then uses GPU0 controlled by
CPU 0, GPU1 controlled by CPU 1, GPU2 controlled by CPU 16, etc.

Normally one uses consecutive numbering in the obvious way, but things
can be associated differently in special cases. For example, suppose on the
other machine one already had one job using 6 CPUs running with %cpu=16-21.
Then if one wanted to use the other 26 CPUs with 8 controlling GPUs one
would specify:

%cpu=0-15,22-31
%gpucpu=0-7=0-1,22-27

This would create 26 threads with GPUs controlled by the threads on
CPUs 0,1,22,23,24,25,26, and 27.

3. GPUs are not helpful for small jobs but are effective for
larger molecules when doing DFT energies, gradients and
frequencies (for both ground and excited states). They are
not used effectively by post-SCF calculations such as MP2 or
CCSD.

Each GPU is several times faster than a CPU but since on
modern machines there are typically many more CPUs than GPUs,
it is important to use all the CPUs as well as the GPUs and
the speedup from GPUs is reduced because many CPUs are also
used effectively (i.e., in a job which uses all the CPUs and
all the GPUs). For example, if the GPU is 5x faster than a
CPU, then the speedup from going to 1 CPU to 1 CPU + 1 GPU
would be 5x, but the speedup going from 32 CPUs to 32 CPUs + 8
GPUs would be 32 CPUs -> 24 CPUs + 8 GPUs, which would be
equivalent to 24 + 5x8 = 64 CPUs, for a speedup of 64/32 or
2x.

4. GPUs on nodes in a cluster can be used. Since the %cpu and %gpucpu specifications
are applied to each node in the cluster, the nodes must have identical configurations
(number of GPUs and their affinity to CPUs); since most clusters are collections of
identical nodes, this is not usually a problem.

IV. CCSD, CCSD(T) and EOM-CCSD calculations

These calculations can use memory to avoid I/O and will run much more efficiently
if they are allowed enough memory to store the amplitudes and product vectors in
memory. If there are O active occupied orbitals (NOA in the output) and V
virtual orbitals (NVB in the output) then approximately 9O^2V^2 words of memory
are required. This does not depend on the number of processors used.

HowTo:gaussian:release

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools