Next: Non-bonded interaction distance-testing
Up: Performance Tuning
Previous: Performance Tuning
Contents
Index
Subsections
The simulation performance obtained from NAMD depends on many factors.
The particular simulation protocol being run is one of the largest
single factors associated with NAMD performance, as different simulation
methods invoke different code that can have substantially different
performance costs, potentially with a different degree of parallel
scalability, message passing activity, hardware acceleration through
the use of GPUs or CPU vectorization,
and other attributes that also contribute to overall NAMD performance.
When NAMD first starts running, it does significant I/O, FFT tuning,
GPU context setup, and other work that is unrelated to normal
simulation activity, so it is important to measure performance only
when NAMD has completed startup and all of the processing units are
running at full speed.
The best way to measure NAMD performance accurately requires running
NAMD for at least 500 to 1,000 steps of normal dynamics (not minimization),
so that load balancing has a chance to
take place several times, and all of the CPUs and GPUs have ramped up
to 100% clock rate. NAMD provides ``Benchmark time:'' and ``TIMING:''
measurements in its output, which can be used for this purpose.
Here, we are only interested in the so-called wall clock time.
Modern GPU performance is now so fast,
especially for GPU-resident mode,
that it is expected to take a much longer run than 1,000 steps
for hardware to achieve a steady thermal state and power utilization
reflecting true long-term simulation performance behavior.
The following config parameter can be used to run
dynamics for some number of seconds.
- benchmarkTime
Perform benchmark for indicated number of seconds
Acceptable Values: positive integer
Description: Use for benchmarking dynamics. Terminate simulation after running for
the indicated number of seconds, where the termination condition is
checked at the next ``TIMING:'' output.
A reasonable benchmark length might be in the range of 120 to 180
(two to three minutes) to provide enough time
for the hardware to fully warm up.
If outputPerformance is left enabled (the default setting)
the final ``PERFORMANCE:'' line will show
the average ns/day performance.
A good value for outputTiming might be 100, 500,
or even 1000, depending on the system size and hardware capability,
where performance monitoring should probably not be done more than
once per second of wall time.
These practices are now the preferred way to benchmark GPU-accelerated
builds of NAMD, because the measurements shown by the
``Benchmark time:'' output lines are gathered too quickly
in the simulation to accurately determine true performance.
Aside from the choice of major simulation protocol and associated
methods in use, it is also important to consider the performance impacts
associated with routine NAMD configuration parameters such as those
that control the frequency of simulation informational outputs and
various types of I/O.
Simulation outputs such as energy information may require NAMD to do additional
computations above and beyond standard force evaluation calculations.
We advise that NAMD simulation configuration parameters be selected such
that output of energies (via the outputEnergies parameter)
be performed only as much as is strictly necessary, since
they otherwise serve to slow down the simulation due to the extra
calculations they require.
NAMD writes ``restart" files to enable simulations that were terminated
unexpectedly (for any reason) to be conveniently restarted from the
most recently written restart file available. While it is desirable
to have a relatively recent restart point to continue from, writing
restart information costs NAMD extra network communication and disk I/O.
If restart files are written too frequently, this extra activity and I/O
will slow down the simulation. A reasonable estimate for restart
frequency is to choose the value such that NAMD writes restart files
about once every ten minutes of wall clock time.
At such a rate, the extra work and I/O associated with writing
the restart files should remain an insignificant factor in NAMD performance.
NAMD is provided in a variety of builds that support platform-specific
techniques such as CPU vectorization and GPU acceleration
to achieve higher arithmetic performance, thereby increasing
NAMD simulation throughput.
Whenever possible NAMD builds should be compiled such that
CPU vector instructions are enabled, and highly tuned
platform-specific NAMD code is employed for performance-critical
force computations.
The so-called ``SMP'' builds of NAMD benefit from reduced memory use
and can in many cases perform better overall, but one trade-off
is that the communication thread is unavailable for simulation work.
NAMD performance can be improved by explicitly setting CPU affinity
using the appropriate Charm++ command line flags, e.g.,
++ppn 7 +commap 0,8 +pemap 1-7,9-15 as an example.
It is often beneficial to reserve one CPU core for the
operating system, to prevent harmful operating system noise or ``jitter'',
particularly when running NAMD on large scale clusters or supercomputers.
The Cray aprun -r 1 command reserves and
forces the operating system to run on the last CPU core.
State-of-the-art compute-optimized GPU accelerators,
can provide NAMD with simulation performance equivalent to
several CPU sockets (on the order of 100 CPU cores) when used to
greatest effect, e.g., when GPUs have sufficient work per GPU.
In general, effective GPU acceleration currently requires on the order
of 10,000 to 100,000 atoms per GPU assuming a fast network interconnect.
NAMD GPU-offload mode requires several CPU cores to drive each GPU effectively,
ensuring that there is always work ready and available for the GPU.
For contemporary CPU and GPU hardware, the most productive ratios of
CPU core counts per GPU might range from 8:1 to 64:1 or higher depending on
the details of the hardware involved.
GPU-offload mode running on modern GPU hardware will be
bottlenecked by the CPU, which means that it is especially important
to follow NUMA domain mapping considerations and generally avoid
running other than one hardware thread per core
(i.e., SMT 1, avoiding use of ``hyper-threading'').
An advantageous hardware configuration might have one NUMA domain
per GPU, which is used most effectively by scheduling one
SMP rank per GPU device / NUMA domain, again leaving at least
one core per NUMA domain available for the communication thread.
GPU-resident mode shifts almost all computational work to the GPU
so generally requires less CPU core support per device than GPU-offload mode,
where the CPU is used primarily for GPU kernel management
and doing infrequent bursts of activity related to atom migration and file I/O
or possibly doing frequent bursts of activity if some host-based
computation such as Colvars is used.
GPU-resident mode uses shared-memory parallelism
but is capable of scaling across multiple GPUs on the same physical node
when the GPUs are interconnected by a high-speed networking fabric,
such as NVLink or NVSwitch for NVIDIA or Infinity Fabric for AMD.
For standard simulations, CPU involvement is low enough
to be unaffected by NUMA domain issues.
However, heavier per-step use of the CPU cores,
such as using Colvars with GPU-resident mode,
might benefit from mapping device PEs to a shared NUMA domain.
When running NAMD on more than a single node, it is important to
use a NAMD version that is optimal for the underlying network hardware
and software you intend to run on. The Charm++ runtime system on which
NAMD is based supports a variety of underlying networks, so be sure to
select a NAMD/Charm++ build that is most directly suited for your
hardware platform. In general, we advise users to avoid the use of
an MPI-based NAMD build as it will underperform compared with a native
network layer such as InfiniBand IB verbs (often referred to as ``verbs''),
the UCX (Unified Communication X) framework (``ucx''),
the Cray-specific ``ofi-crayshasta'' (or ``ofi-crayshasta cxi'') layer
for Slingshot-11,
or the IBM PAMI message passing layer, as practical examples.
Next: Non-bonded interaction distance-testing
Up: Performance Tuning
Previous: Performance Tuning
Contents
Index
http://www.ks.uiuc.edu/Research/namd/